## October 21, 2020

### Column Groups

Filed under: Column groups,extended stats,Oracle,Statistics — Jonathan Lewis @ 12:14 pm BST Oct 21,2020

Here’s an odd little detail about the statistics of column groups. At first glance it’s counter-intuitive but it’s actually an “obvious” (once you’ve thought about it for a bit) consequence of the approximate_ndv() algorithm for gathering stats.

I’ll present it as a question:

I have a table with two columns: flag and v1. Although the column are not declared as non-null neither holds any nulls. If there are 26 distinct values for flag, and 1,000,000 distinct values for v1, what’s the smallest number of distinct values I should see if I create the column group (flag, v1) ?

The question is, of course, a little ambiguous – there’s the number of distinct values that the column (group) holds and the number that a fresh gather of statistics reports it as holding. Here are the stats from a test run of a simple script that creates, populates and gathers stats on my table:

```select  column_name, num_distinct
from    user_tab_cols
where   table_name = 'T1'
/

COLUMN_NAME                      NUM_DISTINCT
-------------------------------- ------------
FLAG                                       26
ID                                    1000000
V1                                     999040
SYS_STUQ#TO6BT1REX3P1BKO0ULVR9         989120
```

There are actually 1,000,000 distinct values for v1 (it’s a varchar2() representation of the id column), but the approximate_ndv() mechanism can have an error of (I believe) up to roughly 1.3%, so Oracle’s estimate here is a little bit off.

The column group (represented by the internal column defonition SYS_STUQ#TO6BT1REX3P1BKO0ULVR9) must hold (at least) 1,000,000 distinct values – but the error in this case is a little larger than the error in v1, with the effect that the number of combinations appears to be less than the number of distinct values for v1!

There’s not much difference in this case between actual and estimate, but there test demonstrates the potential for a significant difference between the estimate and the arithmetic that Oracle would do if the column group didn’t exist. Nominally the optimizer would assume there were 26 million distinct values (though in this case I had only created 1M rows in the table and the optimizer would sanity check that 26M).

So, although the difference between actual and estimate is small, we have to ask the question – are there any cases where the optimizer will ignore the column group stats because of a sanity check that “proves” the estimate is “wrong” – after all it must be wrong if the num_distinct is less than the num_distinct of one of the components. Then again maybe there’s a sanity check that only ignores the column group if the estimate is “wrong enough”, but allows for small variations.

I mention this only because an odd optimizer estimate has shown up recently on the Oracle-L mailing list, and the only significant difference I can see (at present) is that a bad plan appears for a partition where this column group anomaly shows up in the stats, but a good plan appears when the column group anomaly isn’t present.

### Footnote:

If you want to recreate the results above, here’s the model I’ve used (tested on 19.3.0.0 and 11.2.0.4):

```rem
rem     Script:         column_group_stats_5.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested
rem             19.3.0.0
rem             11.2.0.4
rem

execute dbms_random.seed(0)

create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4    -- > comment to avoid WordPress format issue
)
select
chr(65 + mod(rownum,26))        flag,
rownum                          id,
from
generator       v1,
generator       v2
where
rownum <= 1e6   -- > comment to avoid WordPress format issue
order by
dbms_random.value
/

select  column_name, num_distinct
from    user_tab_cols
where   table_name = 'T1'
/

begin
dbms_stats.gather_table_stats(
ownname     => null,
tabname     => 'T1',
method_opt  => 'for all columns size 1 for columns(v1, flag) size 1'
);
end;
/

select  column_name, num_distinct
from    user_tab_cols
where   table_name = 'T1'
/

```

### Footnote 2:

As an interesting little statistical quirk, if I defined the column group as (flag, v1) rather than (v1, flag) the estimate for the column group num_distinct was 1,000,000.