Here’s an odd little detail about the statistics of column groups. At first glance it’s counter-intuitive but it’s actually an “obvious” (once you’ve thought about it for a bit) consequence of the approximate_ndv() algorithm for gathering stats.
I’ll present it as a question:
I have a table with two columns: flag and v1. Although the column are not declared as non-null neither holds any nulls. If there are 26 distinct values for flag, and 1,000,000 distinct values for v1, what’s the smallest number of distinct values I should see if I create the column group (flag, v1) ?
The question is, of course, a little ambiguous – there’s the number of distinct values that the column (group) holds and the number that a fresh gather of statistics reports it as holding. Here are the stats from a test run of a simple script that creates, populates and gathers stats on my table:
select column_name, num_distinct from user_tab_cols where table_name = 'T1' / COLUMN_NAME NUM_DISTINCT -------------------------------- ------------ FLAG 26 ID 1000000 V1 999040 SYS_STUQ#TO6BT1REX3P1BKO0ULVR9 989120
There are actually 1,000,000 distinct values for v1 (it’s a varchar2() representation of the id column), but the approximate_ndv() mechanism can have an error of (I believe) up to roughly 1.3%, so Oracle’s estimate here is a little bit off.
The column group (represented by the internal column defonition SYS_STUQ#TO6BT1REX3P1BKO0ULVR9) must hold (at least) 1,000,000 distinct values – but the error in this case is a little larger than the error in v1, with the effect that the number of combinations appears to be less than the number of distinct values for v1!
There’s not much difference in this case between actual and estimate, but there test demonstrates the potential for a significant difference between the estimate and the arithmetic that Oracle would do if the column group didn’t exist. Nominally the optimizer would assume there were 26 million distinct values (though in this case I had only created 1M rows in the table and the optimizer would sanity check that 26M).
So, although the difference between actual and estimate is small, we have to ask the question – are there any cases where the optimizer will ignore the column group stats because of a sanity check that “proves” the estimate is “wrong” – after all it must be wrong if the num_distinct is less than the num_distinct of one of the components. Then again maybe there’s a sanity check that only ignores the column group if the estimate is “wrong enough”, but allows for small variations.
I mention this only because an odd optimizer estimate has shown up recently on the Oracle-L mailing list, and the only significant difference I can see (at present) is that a bad plan appears for a partition where this column group anomaly shows up in the stats, but a good plan appears when the column group anomaly isn’t present.
Footnote:
If you want to recreate the results above, here’s the model I’ve used (tested on 19.3.0.0 and 11.2.0.4):
rem rem Script: column_group_stats_5.sql rem Author: Jonathan Lewis rem Dated: Oct 2020 rem rem Last tested rem 19.3.0.0 rem 11.2.0.4 rem execute dbms_random.seed(0) create table t1 as with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select chr(65 + mod(rownum,26)) flag, rownum id, lpad(rownum,10,'0') v1 from generator v1, generator v2 where rownum <= 1e6 -- > comment to avoid WordPress format issue order by dbms_random.value / select column_name, num_distinct from user_tab_cols where table_name = 'T1' / begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1 for columns(v1, flag) size 1' ); end; / select column_name, num_distinct from user_tab_cols where table_name = 'T1' /
Footnote 2:
As an interesting little statistical quirk, if I defined the column group as (flag, v1) rather than (v1, flag) the estimate for the column group num_distinct was 1,000,000.
Comments and related questions are welcome.