Oracle Scratchpad

December 3, 2021

Column Groups

Filed under: CBO,Oracle,Statistics — Jonathan Lewis @ 10:40 am GMT Dec 3,2021

Here’s a little test that should have been the second test I ran when column groups were first made available in Oracle. It’s a check on the second of the two basic algorithms that Oracle uses for combining predicates:

  1. sel(predA and predB) = sel(predA) * sel(predB)
  2. sel(predA or predB) = sel(predA) + sel(predB) – sel(predA and predB)

As I pointed out many years ago in Cost Based Oracle – Fundamentals these are the formulae for combining probabilities of independent events. But if the predicates are not independent the formulae make the optimizer produce bad cardinality estimates and that’s why the Oracle developers introduced column groups – virtual columns that gather statistics about combinations of columns.

So let’s set up an extreme example and test the results (I’ve run it one 21c and 19c, but the issue is the same for all versions with column groups):

rem
rem     Script:         column_group_and_or.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2021
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        mod(rownum,100)                 n1,
        mod(rownum,100)                 n2,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator 
;

prompt  =========================================
prompt  Run the script twice, 
prompt  first without gathering the column group
prompt  then after gathering the column group
prompt  =========================================

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for columns (n1, n2) size 1'
        );
end;
/

column column_name format a32

select
        column_name, num_distinct, data_default
from
        user_tab_cols
where
        table_name = 'T1'
/

This script gives you a table with 10,000 rows, where the n1 and n2 columns always have exactly the same values as each other, and each column has 100 rows each of 100 distinct values. If you allow the pl/sql block to run the column stats query will report the following:

COLUMN_NAME                      NUM_DISTINCT DATA_DEFAULT
-------------------------------- ------------ --------------------------------------------------------------------------------
N1                                        100
N2                                        100
V1                                      10000
PADDING                                     1
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS            100 SYS_OP_COMBINED_HASH("N1","N2")

5 rows selected.

After creating the data (and statistics) we can execute the following simple statements:

set autotrace traceonly explain

select  * 
from    t1
where
        n1 = 50
/

select  * 
from    t1
where
        n1 = 50
and     n2 = 50
/

select  *
from    t1
where
        n1 = 50
or      n2 = 50
/

set autotrace off

We know that the first query will return 100 rows, and that the optimizer has enough information to get the correct estimate: the selectivity of “n1 = 50” is 1/100, the number of rows in the table is 10,000, so the estimate should be 1/100 * 10,000 = 100.

For the second query: if we don’t create the column group we expect the optimizer to multiply the selectivities of the two predicates together before multiplying by the number of rows in the table to get an estimate of (1/100 * 1/100) * 10,000 = 1

If we’ve created the column group we expect the optimizer to use the column group selectivity to calculate its estimate (1/100 * 10,000) = 100.

Then we get to the third query: if we don’t create the column group we expect the optimizer to slot 1/100 into the 4 places it has to appear in the formula to get an estimate (1/100 + 1/100 – (1/100 * 1/100)) * 10,000 = 199.

If we have created the column group the third expression in the formula can use the column group 1/num_distinct for the third expression to get: (1/100 + 1/100 – (1/100)) * 10,000 = 100

Here are the three execution plans (with a little cosmetic work) when the column group exists:

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 11800 |    26   (4)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 11800 |    26   (4)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50)


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 11800 |    26   (4)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 11800 |    26   (4)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50 AND "N2"=50)


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   199 | 23482 |    27   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   199 | 23482 |    27   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50 OR "N2"=50)

As you can see from the results in the Rows column, the optimizer has used the column group statistics for the AND estimate, but not for the OR estimate, resulting in an estimate that is out by a factor of nearly 2.

It should be noted that failure to use the column group for AND’ed predicates can introduce a massive error in cardinality estimates but for OR’ed predicates the worst it can be is a factor of 2.

Summary

There is a possible enhancement to the the optimizer’s use of column group statistics that could make the arithmetic consistent for AND and OR predicates and could allow the optimizer to produce better estimates for OR conditions.

Although any error in cardinality estimates can lead to undesirable execution plans the scale of the error due to the current inconsistency will be at worst a factor or 2, which is unlikely to introduce many surprise plans. The nature of the error means the estimate will be too big, which might push the optimizer into using tablescans and hash joins when indexed accesses and nested loops might be the better choice.

6 Comments »

  1. […] A gap in the algorithm (Dec 2021): the optimizer doesn’t use the column group where it should for OR’ed predicates […]

    Pingback by Column Group Catalog | Oracle Scratchpad — December 3, 2021 @ 10:56 am GMT Dec 3,2021 | Reply

  2. I wonder if Nigel Bayliss reads your posts… He should.

    Comment by JAYT22 — December 4, 2021 @ 8:12 am GMT Dec 4,2021 | Reply

  3. […] Lewis has written an interesting article talks about the OR predicate and column correlation. It is mentioned that if there are correlation […]

    Pingback by Disjunctive predicates and column correlation | Chinar Aliyev`s blog — December 7, 2021 @ 7:32 am GMT Dec 7,2021 | Reply

  4. Hi,

    Multi-column or extended statistics are relatively recent in pg world but every supported version has them and I think they should be considered more widely.
    Here is a way to produce a similar test case with PostgreSQL 13 (it also works with pg 14 and pg 15devel) . I wanted to know if the enhancement/feature mentionned in this page is OK.

    
    select version();
                                                              version
    ---------------------------------------------------------------------------------------------------------------------------
     PostgreSQL 13.5 (Debian 13.5-0+deb11u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
    (1 ligne)
    
    create table t1
    as
    with generator(rownum) as (
            select generate_series(1,1e4)
            )
    select
            mod(rownum,100)::integer                 n1,
            mod(rownum,100)::integer                 n2,
            lpad(rownum::text,10,'0')::integer             v1,
            lpad('x',100,'x')               padding
    from
            generator 
    ;
    SELECT 10000
    
    create statistics t1_n1_n2(dependencies, ndistinct, mcv) on n1,n2 from t1;
    CREATE STATISTICS
    
    analyze t1;
    ANALYZE
    
    SELECT reltuples FROM pg_class WHERE relname = 't1';
     reltuples
    -----------
         10000
    (1 ligne)
    
    select * from pg_statistic_ext_data;
    ...
    stxoid           | 16645
    stxdndistinct    | {"1, 2": 100}
    stxddependencies | {"1 =&gt; 2": 1.000000, "2 =&gt; 1": 1.000000}
    ...
    
    explain
    select  * 
    from    t1
    where
            n1 = 50
    ;
                           QUERY PLAN
    --------------------------------------------------------
     Seq Scan on t1  (cost=0.00..307.00 rows=100 width=113)
       Filter: (n1 = '50'::numeric)
    (2 lignes)
    
    explain
    select  * 
    from    t1
    where
            n1 = 50
    and     n2 = 50;
    
    
                           QUERY PLAN
    --------------------------------------------------------
     Seq Scan on t1  (cost=0.00..332.00 rows=100 width=113)
       Filter: ((n1 = 50) AND (n2 = 50))
    (2 lignes)
    
    explain
    select  *
    from    t1
    where
            n1 = 50
    or      n2 = 50
    ;
    
                           QUERY PLAN
    --------------------------------------------------------
     Seq Scan on t1  (cost=0.00..332.00 rows=100 width=113)
       Filter: ((n1 = 50) OR (n2 = 50))
    (2 lignes)
    

    It also works with PostgreSQL 12 but PostgreSQL 13 has further improvements, e.g multi column stats can be used by the planner with IN lists

    Last query with an older version of PostgreSQL (10 or 11, extended stats but without most common values) gives :

    
    
            
    explain
    select  *
    from    t1
    where
            n1 = 50
    or      n2 = 50
    ;
    		QUERY PLAN                       
    --------------------------------------------------------
     Seq Scan on t1  (cost=0.00..332.00 rows=199 width=113)
       Filter: ((n1 = 50) OR (n2 = 50))
    (2 lignes)
    

    PostgreSQL 9.6 had no multi column statistics at all but this version is not supported anymore.

    Best regards,
    Phil

    Comment by Phil — December 20, 2021 @ 11:06 am GMT Dec 20,2021 | Reply

  5. Phil,

    It shouldn’t surprise you that I don’t know the meaning of all the bits and pieces you’ve shown me, but
    a) Clearly the OR cardinality estimate it correct
    b) It looks as if the “create statistics” statement has produced a correlation coefficient between n1 and n2 and applied it suitably to get this result.

    So, bottom line – it looks as if Postgres has a more subtle mechanism for handling column groups (or the equivalent) than Oracle. The next feature to check, though is the cost of gathering the information that allows the Postgres column group to work. (And the cases where the feature is bypassed – Oracle has a lot of them)

    If I were investigating the feature I think my next test would be to change the data pattern in the test to see how accurate Postgres was under fairly reasonable circumstances. I chose my example to make it easy to see that the Oracle estimate was clearly bad; but it’s also a boundary case that might have allowed an alternative strategy to produce the best possible result.

    Regards
    Jonathan Lewis

    Comment by Jonathan Lewis — December 20, 2021 @ 11:24 am GMT Dec 20,2021 | Reply

    • Hi Jonathan,
      I always run the same test https://pgphil.ovh/extended_stats_12_beta_01.php and limitations are progressively lifted. I didn’t test but Oracle probably has no problem to avoid nested loops in such a case, perhaps it does not even need extended statistics for that.
      At first I thought your test “works” with pg 13 because of some magic clever feature, I had read “Allow use of extended statistics objects for OR clauses and IN/ANY constant lists (Pierre Ducroquet, Tomas Vondra)” in PostgreSQL 13 release notes.
      Wrong assumption and it’s simpler, it’s just the Multivariate MCV Lists feature introduced with pg 12 that makes the difference cf https://www.postgresql.org/docs/12/planner-stats.html#PLANNER-STATS-EXTENDED .
      MCV is a very basic option, it stores the most common (n1,n2) combinations. Such a list cannot be unlimited. If it only knows the correlation coefficient between n1 and n2 and the number of distinct (n1,n2) combinations, no magic and postgres planner gives the same estimation than Oracle CBO.


      SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
      pg_mcv_list_items(stxdmcv) m WHERE stxname = ‘t1_n1_n2’ and values = ‘{50,50}’;
      index | values | nulls | frequency | base_frequency
      ——-+———+——-+———–+—————-
      50 | {50,50} | {f,f} | 0.01 | 0.0001
      (1 ligne)

      Base frequency is based on single column stats and frequency is the actual frequency of the combination. It would not be difficult to trick the feature to obtain a bad estimation since it depends on a record size.

      Best regards,
      Phil

      Comment by Phil — December 20, 2021 @ 2:13 pm GMT Dec 20,2021 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

%d bloggers like this: