Oracle Scratchpad

January 14, 2008

Trouble-shooting

Filed under: CBO,Execution plans — Jonathan Lewis @ 1:32 pm GMT Jan 14,2008

A recent item on one of the Oracle forums posed a question about copying statistics from one database to another (typically that would be development to test) to check the sanity of executions on a database that didn’t really have the proper scale of data.

In this case the poster was looking at a test system which had 150% more data than a development system.  After copying the statistics from one system to the other, a simple query showed different execution plans on the two systems, as follows (I’ve trimmed the bytes, cost, and time columns at the right hand to get a better page fit).

Test

------------------------------------------------------------------------------- 
| Id | Operation                | Name          | Rows | Cost | Pstart| Pstop | 
------------------------------------------------------------------------------- 
|  0 | SELECT STATEMENT         |               |   30 |    2 |       |       | 
|  1 |  PARTITION RANGE ITERATOR|               |   30 |    2 |    12 | 14    | 
|* 2 |   TABLE ACCESS FULL      | TBL_TX_REPORT |   30 |    2 |    12 | 14    | 
-------------------------------------------------------------------------------    

Development

 
-------------------------------------------------------------------------------------------------- 
| Id | Operation                            | Name                 | Rows | Cost | Pstart| Pstop | 
-------------------------------------------------------------------------------------------------- 
|  0 | SELECT STATEMENT                     |                      |    1 |    1 |       |       | 
|  1 |  PARTITION RANGE ITERATOR            |                      |    1 |    1 |    12 |    14 | 
|* 2 |   TABLE ACCESS BY LOCAL INDEX ROWID  | TBL_TX_REPORT        |    1 |    1 |    12 |    14 | 
|* 3 |    INDEX RANGE SCAN                  | IX_TBL_TX_LONDONTIME |    1 |    1 |    12 |    14 | 
--------------------------------------------------------------------------------------------------    

The question is, what else (apart from the object statistics) could the plans depend on.

The first thing to note is that the cardinality (rows) estimates are different – so the difference in plans is not inherently to do with costing, it’s to do with the calculation of selectivity. So we need to find out why the selectivity can be different when the object statistics are supposed to be the same.

Investigation point one – we don’t have the query and we don’t have the full execution plan; so we are a little deprived of clues. The execution plan should always be a proper report from dbms_xplan, showing the access_predicates and filter_predicates. In this case, we may find that (for reasons as yet unknown) the predicates reported in the plans are different – and don’t necessarily match the predicates in the original query.

It’s possible that there is a difference in (say) some of the NLS parameters that causes a concealed implicit conversion that blocks an index.  (You can’t help noticing that the index used has the words “London Time” in the name – maybe there’s a funny time zone conversion going on that has been hidden.

Investigation point one and a half – run the query that does a table scan with a hint to use the target index: maybe it simply can’t use it at all.

Investigation point two – assuming the predicates are identical, are all optimizer-related parameters the same (and don’t forget the Oracle version as a starting point). The simplest option is to enable event 10053 and run a simple query like “select ‘rubbish’ from dual”.  There is a section in the trace file that lists all the optimizer environment details. If you find a difference check if the name (and description) make it look like the guilty parameter.

Investigation point three – tedious, but if the predicates are the same and the environments are the same, you have to query the statistics and check that they really are the same. Maybe the export/import simply didn’t work properly. This is really tedious because it means querying user_tables, user_indexes, user_tab_partitions, user_ind_partitions, user_tab_cols, user_part_col_statistics, user_tab_histograms and user_part_histograms for the table and all its indexes and any columns in the “where clause” to see that all the statistical information is identical.

If you get to the end of point three and still can’t spot the issue, you might want to look at the 10053 trace which, in this case, is likely to be fairly short, and see if you can spot where the difference first appears.

12 Comments »

  1. If the statistics were truly “copied” using SQL*Plus copy command, the number values would be converted to decimals (for portability). This will round your density values to 0’s or 1’s when they would likely be some fraction in-between.

    I’ll defer to you on how a change in density could impact the plans if num_distinct and num_nulls are still accurate.

    Comment by Anthony Cordell — January 14, 2008 @ 9:13 pm GMT Jan 14,2008 | Reply

  2. If the statistics are cloned from dev to test which is 150% bigger there is a good chance that the test system has data outside the known range (low – high) of the development (!) statistics.

    If the predicates are not identical but take their cue from the larger data distribution of the test system you can easily run into the situation where the predicate value is outside the low-high column value range. And if the system is a 10g system this leads to the optimizer “Using prorated density: [ 1/(2*num_rows) ] of col #n as selectivity of out-of-range value pred”. The 10053 trace will have the actual number for 1/(2*num_rows) in place of “[ 1/(2*num_rows) ]”

    This has the nasty effect that every index which contains this column has an index range scan cost of blevel+2 which of course is almost guaranteed to be less than any other access path. And since index levels are rarely anything but 2 or three the tie-breaker of the index name will likely be applied, often leading to an absurd index range scan plan.

    A lot of ifs, but not an unrealistic scenario, especially gived the cardinality estimate of 1 ( selectivity * num_rows would equal 0.5, rounded to 1).

    Comment by Wolfgang Breitling — January 14, 2008 @ 9:21 pm GMT Jan 14,2008 | Reply

  3. Anthony,
    I infer from the forum posting that the ‘copy’ was done through the dbms_stats package – which would be using the various export_xxx_stats and import_xxx_stats procedures.

    Wolfgang,
    I discounted the out-of-range idea when jotting my notes – without explaining it – because the partition start and stop were the same. But your comment is still valid – I may have assumed too much.

    In passing, the alphabetic tie-breaker has been delayed by one more test – introduced very recently in 9.2, I think. When the costs for using two different indexes match, the index with the larger number of distinct keys is used. I haven’t investigated this in all circumstances, though.

    Comment by Jonathan Lewis — January 14, 2008 @ 10:42 pm GMT Jan 14,2008 | Reply

  4. I only just now realized that my explanation doesn’t hold water because the plans are reverse. I totally focused the plans and worked from the idea that the first was from development and the second from test but it’s the other way around.

    Comment by Wolfgang Breitling — January 15, 2008 @ 11:03 am GMT Jan 15,2008 | Reply

  5. mumble…

    as you point out, the number or estimated rows is very different.
    So…

    If the query is exactly the same, can we say that the TEST system is not analyzed?
    Or in this case: the statistics are not (well) copied? (even if the programmer states the opposite…)

    Bye,
    Antonio

    Comment by Antonio — January 15, 2008 @ 3:02 pm GMT Jan 15,2008 | Reply

  6. Antonio, it is possible that the statistics have not been well-copied, but there may be a number of ‘surprise’ reasons why the statistics could be as accurately copied as possible and still give an odd difference in execution plan. Here’s a little gem – no tricks: it’s cut and paste from the output of a single script running none-stop on a single database: but with a little cosmetic cleaning:

    (This isn’t the same as the original poster’s problem, of course, just a demonstration that changing a couple of parameters can have a surprising side effect).

    select
            padding
    from
            t1
    where
            n1      = 2
    and     ind_pad = rpad('x',40,'x')
    and     n2      = 2
    /
    
    
    SQL> set autotrace traceonly explain
    SQL> alter session set optimizer_features_enable = '11.1.0.6';
    SQL> @afiedt.buf
    
    ----------------------------------------------------------
    | Id  | Operation         | Name | Rows  | Bytes | Cost  |
    ----------------------------------------------------------
    |   0 | SELECT STATEMENT  |      |   200 | 49400 |    23 |
    |*  1 |  TABLE ACCESS FULL| T1   |   200 | 49400 |    23 |
    ----------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       1 - filter("N1"=2 AND "IND_PAD"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' AND "N2"=2)
    
    
    SQL> alter session set optimizer_features_enable = '9.2.0';
    SQL> @afiedt.buf
    
    ---------------------------------------------------------------------
    | Id  | Operation                   | Name  | Rows  | Bytes | Cost  |
    ---------------------------------------------------------------------
    |   0 | SELECT STATEMENT            |       |    10 |  2470 |    12 |
    |   1 |  TABLE ACCESS BY INDEX ROWID| T1    |    10 |  2470 |    12 |
    |*  2 |   INDEX RANGE SCAN          | T1_I1 |    10 |       |     4 |
    ---------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       2 - access("N1"=2 AND "IND_PAD"='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' AND "N2"=2)
    

    Comment by Jonathan Lewis — January 15, 2008 @ 4:34 pm GMT Jan 15,2008 | Reply

  7. Got the point!

    Better if we test our new Oracle installation a bit more! :lol:

    Comment by Antonio — January 16, 2008 @ 10:40 am GMT Jan 16,2008 | Reply

  8. Investigation point two
    Can we use OPTIMIZER_ENV_HASH_VALUE instead of using the tedious 10053 ?
    regards

    Comment by srivenu — January 17, 2008 @ 10:41 am GMT Jan 17,2008 | Reply

  9. If this is a 10g system, could the automatic stats collection job have stepped in and re-gathered the stats on some or all of the segments? I’m not sure off the top of my head if importing stats re-sets the values in dba_tab_modifications. A quick check on the last-analysed column could help identify that.
    Another possiblity is the presence or otherwise of system statistics. I’d check those before delving into 10053.

    Comment by Martin W — January 17, 2008 @ 3:37 pm GMT Jan 17,2008 | Reply

  10. Srivenu,

    Yes, you could assume that a change in the value of v$sql.optimzer_env_hash_value indicated a change in the optimizer environment, and you could then query v$sql_optimizer_env to find out which parameters had changed. I’m not sure that you would see every possible change, though, because of the number of hidden parameters that don’t show up in the view.

    Martin,
    In this case, you can discount system statistics (at least in the first case) because the first significant difference is that the cardinality has changed. System statistics could have affected the operations used in the plan, but shouldn’t be able to produce a different cardinality. Your point about the stats being changed behind your back by an automatic stats collection is, of course, valid – as is your suggestion for checking the last_analyzed column. There’s scope for endless entertainment on a cold winter’s evening trying to find out what each variation of dbms_stats does in different versions to different columns.

    Comment by Jonathan Lewis — January 17, 2008 @ 8:00 pm GMT Jan 17,2008 | Reply

  11. A follow-up to my response in note 3 – regarding the ‘number of distinct keys’ tie-break. I got an email from Wolfgang Breitling to say that he had never seen this appear; and I had to reply that I hadn’t either I was simply going on the notes in the 9.2.0.7 patch set that listed under “Code enhancement:”

    2720661 CBO enhancement to break tie based on NDK when cost is same

    Following Wolfgang’s email, I’ve tried a couple of experiments to see this enhancement in action in a 9.2.0.8 database – but haven’t succeeded yet.

    [Updated again:] I’ve constructed a demonstration that behaves – i.e. switches indexes – in 10.2.0.1 when CPU costing is disabled; but it doesn’t behave in 9.2.0.8.

    Comment by Jonathan Lewis — February 4, 2008 @ 7:07 pm GMT Feb 4,2008 | Reply

  12. […] 我们可以看到,在超过边界值20131230之后,rows的值就缓慢下降直至变成1。这个下降是个缓慢的过程,且和10g的算法不同,当超过边界时,10g的算法是density=1/(2*NUM_ROWS)(可见Jonathan Lewis’s 的文章),11g的下降过程平缓的多。 […]

    Pingback by » 在跨年时出现执行计划的突变 OracleBlog -天堂向左 DBA向右 — February 2, 2014 @ 9:52 am GMT Feb 2,2014 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.