Oracle Scratchpad

November 18, 2007

FBI problem

Filed under: Statistics,Troubleshooting — Jonathan Lewis @ 10:50 pm BST Nov 18,2007

Here’s an anomaly I came across recently at a client site. The client had upgraded their database from 8i to 10g so that they could run a supported version of Oracle, but they had set the parameter optimizer_features_enable to ‘8.1.7’ to minimise the impact on their system of changes to the optimizer.

This is a reaonable strategy for minimising the time required for regression testing, of course, but it doesn’t work perfectly. A few bits of optimizer code probably fail to respond correctly to the parameter, some bits of optimizer code have bugs in 8i which are fixed in 10g and not ‘unfixed’ by changing the parameter and, occasionally, the change to 10g introduces new bugs that don’t go away when you set the parameter back to emulate 8i. It was one of the latter issues that affected this customer.

Stripped down to the bare minimum – from the original multi-table join – the problem can be demonstrated with the following script:

 
drop table t1;   

create table t1 
as 
with generator as ( 
	select	--+ materialize 
		rownum 	id 
	from	all_objects 
	where	rownum <= 3000 
) 
select 
	rownum			id, 
	'20'			c1, 
	'X'			c2, 
	lpad(rownum,10,'0')	small_vc, 
	rpad('x',100)		padding 
from 
	generator	v1, 
	generator	v2 
where 
	rownum <= 100000 
;   

update t1 
set 
	c1 = '40', 
	c2 = 'A' 
where 
	mod(id,1200) = 0 
;   

update t1 
set 
	c1 = '10', 
	c2 = 'Z' 
where 
	mod(id,950) = 0 
;   

commit;   

create index t1_f1 on t1(decode(c1, '40','40',null)); 
create index t1_f2 on t1(decode(c2,  'A', 'A',null)); 
create index t1_f3 on t1(decode(c1, '40', 'A',null));   

begin   

	dbms_stats.gather_table_stats( 
		ownname		 => user, 
		tabname		 =>'T1', 
		estimate_percent => null, 
		block_sample 	 => true, 
		method_opt 	 => 'for all columns size 10', 
		degree		 => null, 
		granularity      => 'default', 
		cascade		 => false 
	); 
end; 
/   

alter session set optimizer_features_enable='8.1.7';   

set autotrace traceonly explain   

select	* 
from	t1 
where	decode(c1, '40','40',null) = '40' 
;   

select	* 
from	t1 
where	decode(c2, 'A','A',null) = 'A' 
;   

select	* 
from	t1 
where	decode(c1, '40','A',null) = 'A' 
;   

set autotrace off 

All I’ve done with this script is define a table with two critical columns which are character based. They both have a highly skewed distribution, and only a few distinct values. Note, however, that one of the columns only holds values that look like numbers.

In the client’s original code, they wanted to select just those rows which showed the value ’40’ in a column with this type of distribution, so they had created the obvious function-based index (corresponding to t1_f1 in my example) and eveything worked perfectly – until they upgraded to 10g.

As you can see, the three queries are effectively identical (because of the way I’ve created the data and defined the indexes) and should return 80 rows. Here, with a little cosmetic tidying, are the execution plans I got from autotrace

 
--------------------------------------------------------------------- 
| Id  | Operation                   | Name  | Rows  | Bytes | Cost  | 
--------------------------------------------------------------------- 
|   0 | SELECT STATEMENT            |       |   100K|    11M|     2 | 
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |   100K|    11M|     2 | 
|*  2 |   INDEX RANGE SCAN          | T1_F1 |    79 |       |     1 | 
---------------------------------------------------------------------          

Predicate Information (identified by operation id): 
--------------------------------------------------- 
   2 - access(DECODE("C1",'40','40',NULL)='40')          

--------------------------------------------------------------------- 
| Id  | Operation                   | Name  | Rows  | Bytes | Cost  | 
--------------------------------------------------------------------- 
|   0 | SELECT STATEMENT            |       |    79 |  9638 |     2 | 
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    79 |  9638 |     2 | 
|*  2 |   INDEX RANGE SCAN          | T1_F2 |    79 |       |     1 | 
---------------------------------------------------------------------          

Predicate Information (identified by operation id): 
--------------------------------------------------- 
   2 - access(DECODE("C2",'A','A',NULL)='A')          

--------------------------------------------------------------------- 
| Id  | Operation                   | Name  | Rows  | Bytes | Cost  | 
--------------------------------------------------------------------- 
|   0 | SELECT STATEMENT            |       |    79 |  9638 |     2 | 
|   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    79 |  9638 |     2 | 
|*  2 |   INDEX RANGE SCAN          | T1_F3 |    79 |       |     1 | 
---------------------------------------------------------------------          

Predicate Information (identified by operation id): 
--------------------------------------------------- 
   2 - access(DECODE("C1",'40','A',NULL)='A')          

Notice how the estimated cardinality when querying against t1_f1 (the client’s existing index structure) is much higher than it should be.

The index section correctly identifies the number of rows expected, but the table section reports the number of rows in the table, not the number of rows that will be found by using the index. With this massive error in cardinality, the rest of the client’s execution plan was totally inappropriate, taking the query run time from 0.2 seconds to 80 seconds.

The results for the other two indexes, however, are correct; and here’s the execution plan for the first query when run under a real 8i database (with the necessary settings for the “query rewrite” parameters) – again it’s correct:

 
Execution Plan 
---------------------------------------------------------- 
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2 Card=79 Bytes=9401) 
   1    0   TABLE ACCESS (BY INDEX ROWID) OF 'T1' (Cost=2 Card=79 Bytes=9401) 
   2    1     INDEX (RANGE SCAN) OF 'T1_F1' (NON-UNIQUE) (Cost=1 Card=79) 

As far as the customer was concerned we could fix the performance problem quite easily. Function-based indexes are often very localised in their effect, and there were only a dozen or so statements coded to take advantage of the t1_f1 index. So we were able to change the code and index to match the t1_f3 example, run a regression test and move the fix into production fairly quickly.

My best guess is that this is a bug introduced by a cunning trick that came into the optimizer code in 10.2 to do some damage limitation on columns that were declared as character but stored only numeric data. This trick seems to have been back-ported to Oracle 9.2.0.8 – but I don’t think it was in 9.2.0.6, so I’d be interested in hearing (in the comment section) how the test case behaves under 9.2.0.6 and 10.1.0.4.

(P.S. if you are kind enough to run the test on those versions, don’t worry about cutting and pasting the plans into the comment boxes – code snippets usually end up looking a total mess if you do).

6 Comments »

  1. I’ve tested this on 9.2.0.8 with optimizer_features_enable=’9.2.0′ and optimizer_features_enable=’8.1.7′ and on 9.2.0.6 with optimizer_features_enable=’9.2.0′ ( optimizer_features_enable is static in 9i, and I can’t stop this db for test)
    Here the results ( only first, wrong plan)
    9.2.0.8 ( optimizer_features_enable=’9.2.0′)
    Execution Plan
    ———————————————————-
    0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=99367 Bytes=12122774)
    1 0 TABLE ACCESS (BY INDEX ROWID) OF ‘T1′ (Cost=3 Card=99367 Bytes=12122774)
    2 1 INDEX (RANGE SCAN) OF ‘T1_F1′ (NON-UNIQUE) (Cost=2 Card=79)

    9.2.0.8 ( optimizer_features_enable=’8.1.7′)
    Execution Plan
    ———————————————————-
    0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=100000 Bytes=12200000)
    1 0 TABLE ACCESS (BY INDEX ROWID) OF ‘T1′ (Cost=2 Card=100000 Bytes=12200000)
    2 1 INDEX (RANGE SCAN) OF ‘T1_F1′ (NON-UNIQUE) (Cost=1 Card=79)

    9.2.0.6 ( optimizer_features_enable=’9.2.0′)
    Execution Plan
    ———————————————————-
    0 SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=100000 Bytes=12200000)
    1 0 TABLE ACCESS (BY INDEX ROWID) OF ‘T1′ (Cost=2 Card=100000 Bytes=12200000)
    2 1 INDEX (RANGE SCAN) OF ‘T1_F1′ (NON-UNIQUE) (Cost=1 Card=79)

    So, 9.2.0.6 make mistake too, but there is a little difference in numbers between 9.2.0.6 and 9.2.0.8 with optimizer_features_enable=’9.2.0′. And 9.2.0.6 and 9.2.0.8 (optimizer_features_enable=’8.1.7′) shows the same numbers for cardinality.

    Two others selects works fine in all cases.
    PS: Sorry for my English, it’s not native.

    Comment by Pavel Slepushkin — November 19, 2007 @ 6:31 am BST Nov 19,2007 | Reply

  2. I get the same result as your client on 10.1.0.4 and 10.1.0.5

    Comment by Niall Litchfield — November 19, 2007 @ 9:52 am BST Nov 19,2007 | Reply

  3. Pavel, Niall,

    Thanks for your assistance.

    Comment by Jonathan Lewis — November 19, 2007 @ 9:58 am BST Nov 19,2007 | Reply

  4. I wonder:

    Would it make any difference when redefining indexes like this?

    create index t1_f1 on t1(decode(c1, ’40’, c1 ,null));
    create index t1_f2 on t1(decode(c2, ‘A’, c2, null));

    and doing

    select *
    from t1
    where decode(c1, ’40’, c1, null) = ’40’
    ;

    select *
    from t1
    where decode(c2, ‘A’, c2, null) = ‘A’
    ;

    And does the same error surface when using case, e.g.

    create index t1_f1 on t1(case c1 when ’40’ then ’40’ else null end);
    create index t1_f2 on t1(case c2 when ‘A’ then ‘A’ else null end);
    create index t1_f3 on t1(case c1 when ’40’ then ‘A’ else null end);

    and consequently

    select *
    from t1
    where (case c1 when ’40’ then ’40’ else null end) = ’40’
    ;

    select *
    from t1
    where (case c2 when ‘A’ then ‘A’ else null end) = ‘A’
    ;

    select *
    from t1
    where (case c1 when ’40’ then ‘A’ else null end) = ‘A’
    ;

    Unfortunately I don’t have any of the mentioned Oracle versions to test this for myself…

    Comment by Robert — December 17, 2007 @ 4:24 pm BST Dec 17,2007 | Reply

  5. Robert,
    Good idea. But I’ve just tried it on 10.2.0.3 and it didn’t make any difference to the critical case.

    Comment by Jonathan Lewis — December 17, 2007 @ 8:14 pm BST Dec 17,2007 | Reply

  6. Thanks for checking, Jonathan! One more argument for using proper column types (i.e. don’t make it a VARCHAR2 if data is numeric only). :-) I assume the miscalculation does not appear when using numeric types either for the column or the function return value.

    Comment by Robert — December 17, 2007 @ 9:29 pm BST Dec 17,2007 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,268 other followers