Here’s an odd little detail about the statistics of column groups. At first glance it’s counter-intuitive but it’s actually an “obvious” (once you’ve thought about it for a bit) consequence of the approximate_ndv() algorithm for gathering stats.
I’ll present it as a question:
I have a table with two columns: flag and v1. Although the column are not declared as non-null neither holds any nulls. If there are 26 distinct values for flag, and 1,000,000 distinct values for v1, what’s the smallest number of distinct values I could see if I create the column group (flag, v1) ?
The question is a little ambiguous – there’s the number of distinct values that the column (group) holds and the number that a fresh gather of statistics reports it as holding. Here are the stats from a test run of a simple script that creates, populates and gathers stats on my table:
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
COLUMN_NAME NUM_DISTINCT
-------------------------------- ------------
FLAG 26
ID 1000000
V1 999040
SYS_STUQ#TO6BT1REX3P1BKO0ULVR9 989120
There are actually 1,000,000 distinct values for v1 (it’s a varchar2() representation of the id column), but the approximate_ndv() mechanism can have an error of (I believe) up to roughly 1.3%, so Oracle’s estimate here is a little bit off. (It’s interesting to note, though, that the same mechanism managed to produce exactly the right answer for the real numeric.)
The column group (represented by the internal column called SYS_STUQ#TO6BT1REX3P1BKO0ULVR9) must hold (at least) 1,000,000 distinct values – but the error in this case is a little larger than the error in v1, with the effect that the number of combinations appears to be less than the number of distinct values for v1!
There’s not much difference in this case between actual and estimate but the test demonstrates the potential for a significant difference between the estimate and the arithmetic that Oracle would do if the column group didn’t exist. Nominally the optimizer would assume there were 26 million distinct values (though in this case I had only created 1M rows in the table and the optimizer would use the number of rows as a sanity check of that 26M).
So, although the difference between actual and estimate is small, we have to ask the question – are there any cases where the optimizer will ignore the column group stats because of a sanity check that “proves” the estimate is “wrong” – after all it must be wrong if the num_distinct of the column group is less than the num_distinct of one of the components. Then again maybe there’s a sanity check that allows for small variations and ignores the column group only if the estimate is “wrong enough”.
I mention this only because an odd optimizer estimate has shown up recently on the Oracle-L mailing list, and the only significant difference I can see (at present) is that a bad plan appears for a partition when this column group anomaly shows up in the stats and a good plan appears when the column group anomaly isn’t present.
Footnote:
If you want to recreate the results above, here’s the model I’ve used (tested on 19.3.0.0 and 11.2.0.4):
rem
rem Script: column_group_stats_5.sql
rem Author: Jonathan Lewis
rem Dated: Oct 2020
rem
rem Last tested
rem 19.3.0.0
rem 11.2.0.4
rem
execute dbms_random.seed(0)
create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
chr(65 + mod(rownum,26)) flag,
rownum id,
lpad(rownum,10,'0') v1
from
generator v1,
generator v2
where
rownum <= 1e6 -- > comment to avoid WordPress format issue
order by
dbms_random.value
/
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
begin
dbms_stats.gather_table_stats(
ownname => null,
tabname => 'T1',
method_opt => 'for all columns size 1 for columns(v1, flag) size 1'
);
end;
/
select column_name, num_distinct
from user_tab_cols
where table_name = 'T1'
/
Footnote 2:
As an interesting little statistical quirk, if I defined the column group as (flag, v1) rather than (v1, flag) the estimate for the column group num_distinct was 1,000,000.