Oracle Scratchpad

October 22, 2018

Column Groups

Filed under: Bugs,CBO,Column groups,Indexing,Oracle,Statistics — Jonathan Lewis @ 5:36 pm BST Oct 22,2018

Jump to 2023 update

Sometimes a good thing becomes at bad thing when you hit some sort of special case – today’s post is an example of this that came up on the Oracle-L listserver a couple of years ago with a question about what the optimizer was doing. I’ll set the scene by creating some data to reproduce the problem:

rem
rem     Script:         distinct_key_prob.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2016
rem     Purpose:
rem
rem     Last tested
rem             23.3.0.0  (Part fixed)
rem             19.11.0.0
rem             18.3.0.0
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

drop table t1 purge;

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > commment to avoid wordpress format issue
)
select
        cast(mod(rownum-1,10) as number(8,0))   non_null,
        cast(null as number(8,0))               null_col,
        cast(lpad(rownum,10) as varchar2(10))   small_vc,
        cast(rpad('x',100) as varchar2(100))    padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > commment to avoid wordpress format issue
;

create index t1_i1 on t1(null_col, non_null);

begin

/*
        dbms_output.put_line(
                dbms_stats.create_extended_stats(user,'t1','(non_null, null_col)')
        );
*/

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

So I have a table with 1,000,000 rows; one of its columns is always null and another has a very small number of distinct values and is never null (though it hasn’t been declared as not null). I’ve created an index that starts with the “always null” column (in a production system we’d really be looking at a column that was “almost always” null and have a few special rows where the column was not null, so an index like this can make sense).

I’ve also got a few lines, commented out, to create extended stats on the column group (non_null, null_col) because any anomaly relating to the handling of the number of distinct keys in a multi-column index may also be relevant to column groups. I can run two variations of this code, one with the index, one without the index but with the column group, and see the same cardinality issue appearing in both cases.

So let’s execute a couple of queries – after setting up a couple of bind variables – and pull their execution plans from memory:


variable b_null    number
variable b_nonnull number

exec :b_null    := 5
exec :b_nonnull := 5

set serveroutput off

prompt  ===================
prompt  Query null_col only
prompt  ===================

select  count(small_vc)
from    t1
where
        null_col = :b_null
;

select * from table(dbms_xplan.display_cursor(null,null,'-plan_hash'));

prompt  =========================
prompt  Query (null_col,non_null)
prompt  =========================

select  count(small_vc)
from    t1
where
        null_col = :b_null
and     non_null = :b_nonnull
;

select * from table(dbms_xplan.display_cursor(null,null,'-plan_hash'));

The optimizer has statistics that tell it that null_col is always null so its estimate of rows where null_col = 5 should be zero (which will be rounded up to 1); and we have an index starting with null_col so we might expect the optimizer to use an index range scan on that index for these queries. Here are the plans that actually appeared:


SQL_ID  danj9r6rq3c7g, child number 0
-------------------------------------
select count(small_vc) from t1 where  null_col = :b_null

--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |       |       |     2 (100)|          |
|   1 |  SORT AGGREGATE              |       |     1 |    24 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |     1 |    24 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T1_I1 |     1 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("NULL_COL"=:B_NULL)



SQL_ID  d8kbtq594bsp0, child number 0
-------------------------------------
select count(small_vc) from t1 where  null_col = :b_null and non_null =
:b_nonnull

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |       |  2189 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |    27 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |   100K|  2636K|  2189   (4)| 00:00:11 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("NULL_COL"=:B_NULL AND "NON_NULL"=:B_NONNULL))

Take a careful look at what we’ve got: the second query has to access exactly the same table rows as those identified by the first query and then apply a second predicate which may discard some of those rows – but the optimizer has changed the access path from a low-cost index-driven access to a high cost tablescan. This is clearly idiotic – there has to be a flaw in the optimizer logic in this situation.

The defect revolves around a slight inconsistency in the handling of columns groups – whether they are explicitly created, or simply inferred by reference to user_indexes.distinct_keys. The anomaly is most easily seen by explicitly creating the column group, gathering stats, and reporting from user_tab_cols.


select
        column_name, sample_size, num_distinct, num_nulls, density, histogram, data_default
from
        user_tab_cols
where
        table_name = upper('T1')
order by
        column_id

;

OLUMN_NAME                            Sample     Distinct  NUM_NULLS    DENSITY HISTOGRAM       DATA_DEFAULT
-------------------------------- ------------ ------------ ---------- ---------- --------------- --------------------------------------------
NON_NULL                            1,000,000           10          0         .1 NONE
NULL_COL                                                 0    1000000          0 NONE
SMALL_VC                            1,000,000      995,008          0 .000001005 NONE
PADDING                             1,000,000            1          0          1 NONE
SYS_STULC#01EE$DE1QB7UY1K4$PBI      1,000,000           10          0         .1 NONE            SYS_OP_COMBINED_HASH("NON_NULL","NULL_COL")

As you can see, the optimizer can note that “null_col” is always null so the arithmetic for “null_col = :bind1” is going to produce a very small cardinality estimate; on the other hand when the optimizer sees “null_col = :bind1 and non_null = :bind2” it’s going to transform this into the single predicate “SYS_STULC#01EE$DE1QB7UY1K4$PBI = sys_op_combined_hash(null_col, non_null)”, and the statistics say there are 10 distinct values for this (virtual) column with no nulls – hence the huge cardinality estimate and full tablescan.

The “slight inconsistency” in handling that I mentioned above is that if you used a predicate like “null_col is null and non_null = :bind2″ the optimizer would not use column group because of the “is null” condition – even though it’s exactly the case where the column group statistics would be appropriate. (In the example I’ve constructed the optimizer’s estimate from ignoring the column group would actually be correct – and identical to the estimate it would get from using the column group – because the column is null for every single row.)

tl;dr

Column groups can give you some very bad estimates, and counter-intuitive behaviour, if any of the columns in the group has a significant percentage of nulls; this happens because the column group makes the optimizer lose sight of the number of nulls in the underlying data set.

Update Nov 2023:

Re-running the test on newer versions of Oracle I found that 19.11 [update: 19.20 – see comment #8] still gets the cardinality estimates wrong, but 23.3 produces the correct estimate. This change shows up in the 23c v$system_fix_control as (almost certainly) fix 27982637:

QKSFM_CARDINALITY_27982637         fix column group not null sel when one of the columns is all nul

The number doesn’t show up as a bug number of MOS, and there’s no indications of patches or backports, but you might as well try asking for a backport if you’re running into the problem.

Warning: note that the importance of the penultimate word “all” in that description. If you have just one row with a non-null value then the cardinality error still shows up (assuming the stats reflect the reality).

11 Comments »

  1. […] The optimizer’s inconsistency: Column groups lose information about the frequency of nulls in the underlying data. […]

    Pingback by Column Group Catalog | Oracle Scratchpad — October 22, 2018 @ 5:40 pm BST Oct 22,2018 | Reply

  2. […] on its own. (The same problem appears with column groups – which is what I commented on in my previous post on this […]

    Pingback by Extreme Nulls | Oracle Scratchpad — December 14, 2018 @ 7:01 pm GMT Dec 14,2018 | Reply

  3. Hi Jonathon – is there a way this fix this Cardinality Estimation for this query from your test case select count(small_vc) from t1 where null_col = :b_null and non_null = :b_nonnull ?

    Comment by Bhavani P Dhulipalla — August 9, 2019 @ 3:27 am BST Aug 9,2019 | Reply

  4. […] column group sample size matches the table sample size because column group hash values are never null, even if an underlying column is […]

    Pingback by Strange Estimates. | Oracle Scratchpad — October 30, 2019 @ 1:10 pm GMT Oct 30,2019 | Reply

  5. […] colX is null is not an equality predicate, and column group stats will not apply but there can be unexpected side effects even for cases where you don’t use this “is null” […]

    Pingback by Optimizer Tip | Oracle Scratchpad — September 20, 2021 @ 9:04 am BST Sep 20,2021 | Reply

  6. […] loop join, and the poor choice of plan was due an optimizer defect when handling column groups (fixed in 23c) when one or more of the columns involved is always, or nearly always, […]

    Pingback by Swap_Join_Inputs | Oracle Scratchpad — November 6, 2023 @ 2:47 pm GMT Nov 6,2023 | Reply

  7. Dear Jonathan,

    Can you please help me to understand, in which cases this “all null” column need to be part of the index and as a leading column?

    Comment by Anonymous — November 7, 2023 @ 3:26 am GMT Nov 7,2023 | Reply

    • Anonymous,

      Sometimes I just answer questions that are basically about the arithmetic without pursuing the intention.
      It’s a fairly common strategy to have a “status” column where almost all the rows end up as “completed” and only a very few have some other status, and a good strategy for dealing with this is to use NULL as the “completed” status, or to create a function-based index that emulates this. The benefit of this strategy is that you can have a tiny index even though the table is huge.

      Some 3rd party applications, however, will adopt this type of strategy but add extra columns after the “status” column. You may also see 3rd party applications which have “business-generic” tables where different customers use different subsets of columns and then have to deal with larger numbers of indexes that are only partially populated (with respect to which columns are used). Very often the business will resist dropping indexes that are irrelevant, leaving them with many “strange” indexes.

      Note, by the way that the column doesn’t have to be the leading column in the index for the above arithmetic to be relevant.

      Regards
      Jonathan Lewis

      Comment by Jonathan Lewis — November 9, 2023 @ 1:27 pm GMT Nov 9,2023 | Reply

  8. For what it’s worth, the cardinality estimate is still incorrect in 19.20

    |* 2 | TABLE ACCESS STORAGE FULL| T1 | 100K| 2636K| 4647 (1)| 00:00:01 |

    Comment by Anonymous — November 29, 2023 @ 5:18 pm GMT Nov 29,2023 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.