Oracle Scratchpad

October 21, 2020

Column Groups

Filed under: Column groups,extended stats,Oracle,Statistics — Jonathan Lewis @ 12:14 pm BST Oct 21,2020

Here’s an odd little detail about the statistics of column groups. At first glance it’s counter-intuitive but it’s actually an “obvious” (once you’ve thought about it for a bit) consequence of the approximate_ndv() algorithm for gathering stats.

I’ll present it as a question:

I have a table with two columns: flag and v1. Although the column are not declared as non-null neither holds any nulls. If there are 26 distinct values for flag, and 1,000,000 distinct values for v1, what’s the smallest number of distinct values I could see if I create the column group (flag, v1) ?

The question is a little ambiguous – there’s the number of distinct values that the column (group) holds and the number that a fresh gather of statistics reports it as holding. Here are the stats from a test run of a simple script that creates, populates and gathers stats on my table:

select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/

COLUMN_NAME                      NUM_DISTINCT
-------------------------------- ------------
FLAG                                       26
ID                                    1000000
V1                                     999040
SYS_STUQ#TO6BT1REX3P1BKO0ULVR9         989120

There are actually 1,000,000 distinct values for v1 (it’s a varchar2() representation of the id column), but the approximate_ndv() mechanism can have an error of (I believe) up to roughly 1.3%, so Oracle’s estimate here is a little bit off. (It’s interesting to note, though, that the same mechanism managed to produce exactly the right answer for the real numeric.)

The column group (represented by the internal column called SYS_STUQ#TO6BT1REX3P1BKO0ULVR9) must hold (at least) 1,000,000 distinct values – but the error in this case is a little larger than the error in v1, with the effect that the number of combinations appears to be less than the number of distinct values for v1!

There’s not much difference in this case between actual and estimate but the test demonstrates the potential for a significant difference between the estimate and the arithmetic that Oracle would do if the column group didn’t exist. Nominally the optimizer would assume there were 26 million distinct values (though in this case I had only created 1M rows in the table and the optimizer would use the number of rows as a sanity check of that 26M).

So, although the difference between actual and estimate is small, we have to ask the question – are there any cases where the optimizer will ignore the column group stats because of a sanity check that “proves” the estimate is “wrong” – after all it must be wrong if the num_distinct of the column group is less than the num_distinct of one of the components. Then again maybe there’s a sanity check that allows for small variations and ignores the column group only if the estimate is “wrong enough”.

I mention this only because an odd optimizer estimate has shown up recently on the Oracle-L mailing list, and the only significant difference I can see (at present) is that a bad plan appears for a partition when this column group anomaly shows up in the stats and a good plan appears when the column group anomaly isn’t present.

Footnote:

If you want to recreate the results above, here’s the model I’ve used (tested on 19.3.0.0 and 11.2.0.4):

rem
rem     Script:         column_group_stats_5.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem             11.2.0.4
rem 

execute dbms_random.seed(0)

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        chr(65 + mod(rownum,26))        flag,
        rownum                          id,
        lpad(rownum,10,'0')             v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6   -- > comment to avoid WordPress format issue
order by
        dbms_random.value
/


select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns(v1, flag) size 1'
        );
end;
/
 
select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/

Footnote 2:

As an interesting little statistical quirk, if I defined the column group as (flag, v1) rather than (v1, flag) the estimate for the column group num_distinct was 1,000,000.

1 Comment »

  1. […] NDV oddity for groups (Oct 2020): A column group will have at least as many values as each of its components – but gathered stats don’t always report that. […]

    Pingback by Column Group Catalogue | Oracle Scratchpad — December 3, 2022 @ 5:50 pm GMT Dec 3,2022 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.