Strange Estimates.

October 30, 2019

Strange Estimates.

Filed under: CBO,extended stats,Oracle,Statistics — Jonathan Lewis @ 1:10 pm GMT Oct 30,2019

A question came up on the Oracle-L list server a couple of days ago expressing some surprise at the following execution plan:


--------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                 |       |       |   845 (100)|          |
|   1 |  SORT AGGREGATE                      |                 |     1 |     7 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| ANY_TABLE       | 84827 |   579K|   845   (1)| 00:00:01 |
|   3 |    SORT CLUSTER BY ROWID             |                 | 68418 |       |    76   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                 | ANY_INDEX       | 68418 |       |    76   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("X"."ANY_COLUMN1"='J')
   4 - access("X"."ANY_COLUMN2"=89155)

You’ll notice that this is a very simple query accessing a table by index, yet the estimated table rows found exceeds the estimated number of index entries used to probe the table. How can this happen. The answer (most frequently) is that there’s a mismatch between the table (or, more commonly, column) statistics and the index statistics. This seems to happen very frequently when you start mixing partitioned tables with global (or globally partitioned) indexes but it can happen in very simple cases, especially since a call to gather_table_stats() with cascade set to true and using the auto_sample_size will take a small sample from the index while using a 100% “sample” from the table.

Here’s an example I engineered very quickly to demonstrate the point. There’s no particular reason for the choice of DML I’ve used on the data beyond a rough idea of setting up a percentage of nulls and deleting a non-uniform pattern of rows.


rem
rem     Script:         table_index_mismatch.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem
create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,1000)                n1,
        mod(rownum,1000)                n2,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1, for columns (n1,n2) size 1'
        );
end;
/

create index t1_i1 on t1(n1);

delete from t1 where mod(trunc(sqrt(n1)),7) = 0;
update t1 set n1 = null where mod(n1,10) = 0;
delete from t1 where mod(n1,10) = trunc(dbms_random.value(0,10));

execute dbms_stats.gather_table_stats(user,'t1',estimate_percent=>1)
execute dbms_stats.gather_index_stats(null,'t1_i1',estimate_percent=> 0.01)

Of course you’re not supposed to collect stats with arbitrary samples in any recent version of Oracle, so going for a 1% and 0.01% sample seems a little daft but I’m just doing that to demonstrate the problem with a very small data set.

After generating the data and gathering the stats I ran a few queries to pick out some critical numbers.


select
        table_name, sample_size, num_rows
from
        user_tables
where
        table_name = 'T1'
/

select 
        index_name, sample_size, num_rows, distinct_keys
from
        user_indexes
where
        table_name = 'T1'
and     index_name = 'T1_I1'
/

select
        column_name, sample_size, num_nulls, num_distinct
from
        user_tab_cols
where
        table_name = 'T1'
and     (
            column_name = 'N1'
         or virtual_column = 'YES'
        )
order by
        column_name
/

You’ll notice that I’ve only picked one of my original columns and any virtual columns. My gather_table_stats() call had a method_opt that included the creation of extended stats for the column group (n1, n2) and I want to report the stats on the resulting virtual column.


TABLE_NAME           SAMPLE_SIZE   NUM_ROWS
-------------------- ----------- ----------
T1                          7865     786500


INDEX_NAME           SAMPLE_SIZE   NUM_ROWS DISTINCT_KEYS
-------------------- ----------- ---------- -------------
T1_I1                     385779     713292           714


COLUMN_NAME                      SAMPLE_SIZE  NUM_NULLS NUM_DISTINCT
-------------------------------- ----------- ---------- ------------
N1                                      7012      85300          771
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS          7865          0          855

A couple of observations on the stats

the table sample size is, as expected, 1% of the reported num_rows (the actual count is 778,154).
The index sample size is much bigger than expected – but that’s probably related to the normal “select 1,100 leaf blocks strategy”. Because of the skew in the pattern of deleted values it’s possible for the sample size in this model to vary between 694,154 and something in the region of 380,000.
The n1 sample size is about 10% smaller than the table sample size – but that’s because I set 10% of the column to null.
The column group sample size matches the table sample size because column group hash values are never null, even if an underlying column is null.

So let’s check the execution plan for a very simple query:


set autotrace on explain
select id from t1 where n1 = 140 and n2 = 140;
set autotrace off


---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |   920 | 11960 |   918   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |   920 | 11960 |   918   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T1_I1 |   909 |       |     5   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N2"=140)
   2 - access("N1"=140)

The estimate for relevant index rowids is smaller than the estimate for the number of table rows! The difference is not as extreme as the case reported on Oracle-l, but I’m only trying to demonstrate a principle, not reproduce the exact results.

There are several ways in which contradictory results like this can appear – but in this case we can see the following:

For the table access table.num_rows/column.num_distinct = 786,500 / 855 = 919.88 (using the column group num_distinct)
For the index range scan: (table.num_rows – column.num_nulls) / column.num_distinct = (786500 – 85300) / 771 = 909.47 (using the n1 statistics)

So the change in strategy as it becomes possible for the optimizer to take advantage of the column group means the index and table have been using incompatible sets of stats (in particular there’s that loss of information about NULLs) as their cardinalities are calculated. The question, then, is “how much is that likely to matter”, and the follow-up if it can matter is “in what circumstancs could the effect be large enough to cause problems”. But that’s a topic for another day.

Update / Footnote

In the case of the Oracle-l example, there was no column group, and in some cases the optimizer would produce a plan where the table estimate was much smaller than the index estimate, and in other cases (like the opening plan above) the table estimate was signficantly greater than the index estimate. This was a side effect of adaptive statistics: the low table estimate was due to the basic “multiply separate selectivities”; but the with adaptive statistics enabled Oracle started sampling the table to check the correlation between the two tables, and then produced an SQL Plan Directive to do so and got to the higher (and correct) result.

Oracle Scratchpad

October 30, 2019