Here’s a little test that should have been the second test I ran when ** column groups** were first made available in Oracle. It’s a check on the second of the two basic algorithms that Oracle uses for combining predicates:

- sel(predA
predB) = sel(predA)*and******sel(predB) - sel(predA
predB) = sel(predA)*or***+**sel(predB) – sel(predApredB)*and*

As I pointed out many years ago in ** Cost Based Oracle – Fundamentals** these are the formulae for combining probabilities of

**events. But if the predicates are not independent the formulae make the optimizer produce bad cardinality estimates and that’s why the Oracle developers introduced column groups – virtual columns that gather statistics about combinations of columns.**

*independent*So let’s set up an extreme example and test the results (I’ve run it one 21c and 19c, but the issue is the same for all versions with column groups):

```
rem
rem Script: column_group_and_or.sql
rem Author: Jonathan Lewis
rem Dated: Dec 2021
rem
rem Last tested
rem 21.3.0.0
rem 19.11.0.0
rem
create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
mod(rownum,100) n1,
mod(rownum,100) n2,
lpad(rownum,10,'0') v1,
lpad('x',100,'x') padding
from
generator
;
prompt =========================================
prompt Run the script twice,
prompt first without gathering the column group
prompt then after gathering the column group
prompt =========================================
begin
dbms_stats.gather_table_stats(
ownname => null,
tabname => 'T1',
method_opt => 'for columns (n1, n2) size 1'
);
end;
/
column column_name format a32
select
column_name, num_distinct, data_default
from
user_tab_cols
where
table_name = 'T1'
/
```

This script gives you a table with 10,000 rows, where the ** n1 **and

**columns always have exactly the same values as each other, and each column has 100 rows each of 100 distinct values. If you allow the pl/sql block to run the column stats query will report the following:**

*n2*```
COLUMN_NAME NUM_DISTINCT DATA_DEFAULT
-------------------------------- ------------ --------------------------------------------------------------------------------
N1 100
N2 100
V1 10000
PADDING 1
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS 100 SYS_OP_COMBINED_HASH("N1","N2")
5 rows selected.
```

After creating the data (and statistics) we can execute the following simple statements:

```
set autotrace traceonly explain
select *
from t1
where
n1 = 50
/
select *
from t1
where
n1 = 50
and n2 = 50
/
select *
from t1
where
n1 = 50
or n2 = 50
/
set autotrace off
```

We know that the first query will return 100 rows, and that the optimizer has enough information to get the correct estimate: the selectivity of *“n1 = 50”* is 1/100, the number of rows in the table is 10,000, so the estimate should be 1/100 * 10,000 = 100.

For the second query: if we don’t create the column group we expect the optimizer to multiply the selectivities of the two predicates together before multiplying by the number of rows in the table to get an estimate of (1/100 * 1/100) * 10,000 = 1

If we’ve created the column group we expect the optimizer to use the column group selectivity to calculate its estimate (1/100 * 10,000) = 100.

Then we get to the third query: if we don’t create the column group we expect the optimizer to slot 1/100 into the 4 places it has to appear in the formula to get an estimate (1/100 + 1/100 – (1/100 * 1/100)) * 10,000 = 199.

If we have created the column group the third expression in the formula can use the column group 1/** num_distinct** for the third expression to get: (1/100 + 1/100 – (1/100)) * 10,000 = 100

Here are the three execution plans (with a little cosmetic work) when the column group exists:

```
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 11800 | 26 (4)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 100 | 11800 | 26 (4)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=50)
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 11800 | 26 (4)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 100 | 11800 | 26 (4)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=50 AND "N2"=50)
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 199 | 23482 | 27 (8)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 199 | 23482 | 27 (8)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=50 OR "N2"=50)
```

As you can see from the results in the ** Rows **column, the optimizer has used the column group statistics for the

**estimate, but not for the**

*AND***estimate, resulting in an estimate that is out by a factor of nearly 2.**

*OR*It should be noted that failure to use the column group for AND’ed predicates can introduce a massive error in cardinality estimates but for OR’ed predicates the worst it can be is a factor of 2.

### Summary

There is a possible enhancement to the the optimizer’s use of column group statistics that could make the arithmetic consistent for AND and OR predicates and could allow the optimizer to produce better estimates for OR conditions.

Although any error in cardinality estimates can lead to undesirable execution plans the scale of the error due to the current inconsistency will be at worst a factor or 2, which is unlikely to introduce many surprise plans. The nature of the error means the estimate will be too big, which might push the optimizer into using tablescans and hash joins when indexed accesses and nested loops might be the better choice.

[…] A gap in the algorithm (Dec 2021): the optimizer doesn’t use the column group where it should for OR’ed predicates […]

Pingback by Column Group Catalog | Oracle Scratchpad — December 3, 2021 @ 10:56 am GMT Dec 3,2021 |

I wonder if Nigel Bayliss reads your posts… He should.

Comment by JAYT22 — December 4, 2021 @ 8:12 am GMT Dec 4,2021 |

[…] Lewis has written an interesting article talks about the OR predicate and column correlation. It is mentioned that if there are correlation […]

Pingback by Disjunctive predicates and column correlation | Chinar Aliyev`s blog — December 7, 2021 @ 7:32 am GMT Dec 7,2021 |

Hi,

Multi-column or extended statistics are relatively recent in pg world but every supported version has them and I think they should be considered more widely.

Here is a way to produce a similar test case with PostgreSQL 13 (it also works with pg 14 and pg 15devel) . I wanted to know if the enhancement/feature mentionned in this page is OK.

It also works with PostgreSQL 12 but PostgreSQL 13 has further improvements, e.g multi column stats can be used by the planner with IN lists

Last query with an older version of PostgreSQL (10 or 11, extended stats but without most common values) gives :

PostgreSQL 9.6 had no multi column statistics at all but this version is not supported anymore.

Best regards,

Phil

Comment by Phil — December 20, 2021 @ 11:06 am GMT Dec 20,2021 |

Phil,

It shouldn’t surprise you that I don’t know the meaning of all the bits and pieces you’ve shown me, but

a) Clearly the OR cardinality estimate it correct

b) It looks as if the “create statistics” statement has produced a correlation coefficient between n1 and n2 and applied it suitably to get this result.

So, bottom line – it looks as if Postgres has a more subtle mechanism for handling column groups (or the equivalent) than Oracle. The next feature to check, though is the cost of gathering the information that allows the Postgres column group to work. (And the cases where the feature is bypassed – Oracle has a lot of them)

If I were investigating the feature I think my next test would be to change the data pattern in the test to see how accurate Postgres was under fairly reasonable circumstances. I chose my example to make it easy to see that the Oracle estimate was clearly bad; but it’s also a boundary case that might have allowed an alternative strategy to produce the best possible result.

Regards

Jonathan Lewis

Comment by Jonathan Lewis — December 20, 2021 @ 11:24 am GMT Dec 20,2021 |

Hi Jonathan,

I always run the same test https://pgphil.ovh/extended_stats_12_beta_01.php and limitations are progressively lifted. I didn’t test but Oracle probably has no problem to avoid nested loops in such a case, perhaps it does not even need extended statistics for that.

At first I thought your test “works” with pg 13 because of some magic clever feature, I had read “Allow use of extended statistics objects for OR clauses and IN/ANY constant lists (Pierre Ducroquet, Tomas Vondra)” in PostgreSQL 13 release notes.

Wrong assumption and it’s simpler, it’s just the Multivariate MCV Lists feature introduced with pg 12 that makes the difference cf https://www.postgresql.org/docs/12/planner-stats.html#PLANNER-STATS-EXTENDED .

MCV is a very basic option, it stores the most common (n1,n2) combinations. Such a list cannot be unlimited. If it only knows the correlation coefficient between n1 and n2 and the number of distinct (n1,n2) combinations, no magic and postgres planner gives the same estimation than Oracle CBO.

SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),

pg_mcv_list_items(stxdmcv) m WHERE stxname = ‘t1_n1_n2’ and values = ‘{50,50}’;

index | values | nulls | frequency | base_frequency

——-+———+——-+———–+—————-

50 | {50,50} | {f,f} | 0.01 | 0.0001

(1 ligne)

Base frequency is based on single column stats and frequency is the actual frequency of the combination. It would not be difficult to trick the feature to obtain a bad estimation since it depends on a record size.

Best regards,

Phil

Comment by Phil — December 20, 2021 @ 2:13 pm GMT Dec 20,2021 |