Oracle Scratchpad

May 20, 2022

Quiz Night

Filed under: Infrastructure,Oracle,Problem Solving,Statistics — Jonathan Lewis @ 6:09 pm BST May 20,2022

It’s a long time since I’ve done a quiz night – but here’s something that fooled me (briefly) when it appeared on the Oracle Developers’ Forum. Here’s a table definition – and I’m not going to make it easy by giving you a “create table” statement – but it’s just a simple heap table:

sql> desc interr_skuplannparam
 name                             null?    type
 -------------------------------- -------- -------------------
 atpdur                                    number(38)
 depdmdopt                                 number(38)
 externalskusw                             number(1)
 firstreplendate                           date
 lastfrzstart                              date
 lastplanstart                             date
 plandur                                   number(38)
 planleadtime                              number(38)
 planleadtimerule                          number(38)
 planshipfrzdur                            number(38)
 restrictdur                               number(38)
 allocbatchsw                              number(1)
 cmpfirmdur                                number(38)
 custservicelevel                          float(126)
 maxchangefactor                           float(126)
 mfgleadtime                               number(38)
 recschedrcptsdur                          number(38)
 cpppriority                               number(38)
 cpplocksw                                 number(1)
 criticalmaterialsw                        number(1)
 aggexcesssupplyrule                       number(38)
 aggundersupplyrule                        number(38)
 bufferleadtime                            number(38)
 maxoh                                     float(126)
 maxcovdur                                 number(38)
 drpcovdur                                 number(38)
 drpfrzdur                                 number(38)
 drprule                                   number(38)
 drptimefencedate                          date
 drptimefencedur                           number(38)
 incdrpqty                                 float(126)
 mindrpqty                                 float(126)
 mpscovdur                                 number(38)
 mfgfrzdur                                 number(38)
 mpsrule                                   number(38)
 mpstimefencedate                          date
 mpstimefencedur                           number(38)
 incmpsqty                                 float(126)
 minmpsqty                                 float(126)
 shrinkagefactor                           number(38)
 item                                      varchar2(50 char)
 loc                                       varchar2(50 char)
 expdate                                   date
 atprule                                   number(38)
 prodcal                                   varchar2(50 char)
 prodstartdate                             date
 prodstopdate                              date
 orderingcost                              float(126)
 holdingcost                               float(126)
 eoq                                       float(126)
 ff_trigger_control                        number(38)
 workingcal                                varchar2(50 char)
 lookaheaddur                              number
 orderpointrule                            number
 orderskudetailsw                          number(1)
 supsdmindmdcovdur                         number(38)
 orderpointminrule                         number(38)
 orderpointminqty                          float(126)
 orderpointmindur                          number(38)
 orderuptolevelmaxrule                     number(38)
 orderuptolevelmaxqty                      float(126)
 orderuptolevelmaxdur                      number(38)
 aggskurule                                number(38)
 fwdbuymaxdur                              number(38)
 costuom                                   number(38)
 cumleadtimedur                            number(38)
 cumleadtimeadjdur                         number(38)
 cumleadtimerule                           number(38)
 roundingfactor                            float(126)
 limitplanarrivpublishsw                   number(1)
 limitplanarrivpublishdur                  number
 maxohrule                                 number(1)
 integration_stamp                         date
 integration_jobid                not null varchar2(32 char)
 error_str                                 varchar2(2000 char)
 error_stamp                               date

The column integration_jobid (the single “not null” column) has been defined with the default value of “INT_JOB”, which takes 7 bytes to store. What’s the result of the query at the end of this little script:

truncate table interr_skuplannparam;

insert into interr_skuplannparam (atpdur) 
select   0 
from     all_objects 
where    rownum <= 10000
/

commit;

execute dbms_stats.gather_table_stats(user,'interr_skuplannparam')

select  avg_row_len 
from    user_tables
where   table_name = 'INTERR_SKUPLANNPARAM' -- added later
;

Hint: the value zero is represented internally as a single byte holding the value 0x80 (decimal 128).

Answer

It’s fairly common knowledge that Oracle includes a “length byte” in the estimates for average column length and average row length. It’s also fairly well known that “trailing nulls take no space” in a row so don’t need a length byte. There are a couple of “less common knowledge” details to add, though:

  1. any column that is null in every row gets a zero for the avg_col_len even if it isn’t a “trailing null”.
  2. but point 1 doesn’t apply to date columns that are always null, they get a 1 for the avg_col_len even the column is a “trailing null”.. The same is true for the various timestamp and interval types.
  3. for columns that hold at least one value the avg_col_len is the average over all rows of the actual space used by that column’s data, rounded up, plus 1 for the length byte.
  4. the avg_row_len is not the sum(avg_col_len) it is based on the average of the summed column lengths for each row, plus the count of the length bytes recorded.
  5. User defined type, LOBs, varray types etc. introduce all sorts of other strange effects. (But that’s probably “more common” knowledge.

So what does that mean in my example where there’s a declared not null column near the end of the row, with two trailing columns and with every column except the first and the non-null column set to null for every single row in the table? The easy option is to create the model and show you the results of querying user_tab_cols.

break on report 
compute sum of avg_col_len on report

select  avg_row_len 
from    user_tables
where   table_name = 'INTERR_SKUPLANNPARAM'
;

select 
        column_name, data_type, avg_col_len 
from
         user_tab_cols 
where 
        table_name = 'INTERR_SKUPLANNPARAM'
and     avg_col_len != 0
order by
        column_id
;


============

AVG_ROW_LEN
-----------
         20

1 row selected.


 COLUMN_ID SEGMENT_COLUMN_ID COLUMN_NAME              DATA_TYPE                 AVG_COL_LEN
---------- ----------------- ------------------------ ------------------------- -----------
         1                 1 ATPDUR                   NUMBER                              2
         4                 4 FIRSTREPLENDATE          DATE                                1
         5                 5 LASTFRZSTART             DATE                                1
         6                 6 LASTPLANSTART            DATE                                1
        29                29 DRPTIMEFENCEDATE         DATE                                1
        36                36 MPSTIMEFENCEDATE         DATE                                1
        43                43 EXPDATE                  DATE                                1
        46                46 PRODSTARTDATE            DATE                                1
        47                47 PRODSTOPDATE             DATE                                1
        73                73 INTEGRATION_STAMP        DATE                                1
        74                74 INTEGRATION_JOBID        VARCHAR2                            8
        76                76 ERROR_STAMP              DATE                                1
                                                                                -----------
sum                                                                                      20

12 rows selected.


My query of user_tab_cols orders by column_id, technically it should order by segment_column_id to show the physical ordering in the data segment to allow for all the strange effects you can get in more complex scenarios, but in this very simple case the two values are the same.

As you can see, every date type (including the trailing error_stamp) has an avg_col_len of 1, even though all the dates are null in every row. Column atdpur has avg_col_len = 2, which is 1 byte for storing zero plus a length byte and integration_job_id has avg_col_len = 8, which is 7 bytes for storing ‘INT_JOB’ plus a length byte.

In this case where every single row is identical there are no rounding effects due to the calculation of average column length (the column data stored is the same in every row) so the avg_row_len = sum(avg_col_len).

Conclusion

This was an extreme (though real) production) case. I wrote this note because someone was asking why there was such a massive difference between their estimate of the space needed for a table and the actual space taken by the table.

There are many reasons why this could happen, of course, but in their case their estimate was based on multiplying the avg_row_len by the number of rows in the table – and they had a large number of columns that were always null but still required space for the length byte, so the avg_row_len was about half the actual stored space for the average row. (And then the avg_row_len doesn’t allow for the row overhead and row directory entry, which total another 5 bytes per row).

Lagniappe

It wasn’t part of the quiz, but once you’ve got a simple model you can always start playing with it to find out a little more, and to check how well your guesswork holds up as you make the scenario more complicated. For example, let’s update a few columns in just one row of the table, gather stats again, and report the row and column lengths.

update interr_skuplannparam
set 
        shrinkagefactor = 1234567890,
        item            = 'xxx',
        expdate         = sysdate
where
        rownum = 1
/

commit;

execute dbms_stats.gather_table_stats(user,'interr_skuplannparam')

select  avg_row_len 
from    user_tables
where   table_name = 'INTERR_SKUPLANNPARAM'
;

select 
        column_id, segment_column_id, column_name, data_type, avg_col_len 
from
        user_tab_cols 
where 
        table_name = 'INTERR_SKUPLANNPARAM'
and     avg_col_len != 0
order by
        column_id
;

============================

AVG_ROW_LEN
-----------
         22

1 row selected.


 COLUMN_ID SEGMENT_COLUMN_ID COLUMN_NAME              DATA_TYPE                 AVG_COL_LEN
---------- ----------------- ------------------------ ------------------------- -----------
         1                 1 ATPDUR                   NUMBER                              2
         4                 4 FIRSTREPLENDATE          DATE                                1
         5                 5 LASTFRZSTART             DATE                                1
         6                 6 LASTPLANSTART            DATE                                1
        29                29 DRPTIMEFENCEDATE         DATE                                1
        36                36 MPSTIMEFENCEDATE         DATE                                1
        40                40 SHRINKAGEFACTOR          NUMBER                              2
        41                41 ITEM                     VARCHAR2                            2
        43                43 EXPDATE                  DATE                                2
        46                46 PRODSTARTDATE            DATE                                1
        47                47 PRODSTOPDATE             DATE                                1
        73                73 INTEGRATION_STAMP        DATE                                1
        74                74 INTEGRATION_JOBID        VARCHAR2                            8
        76                76 ERROR_STAMP              DATE                                1
                                                                                -----------
sum                                                                                      25


The total of the avg_col_len has gone up from 20 to 25 – this is two bytes each for the shrinkage_factor and item columns (a tiny average for the stored data, plus 1 for a length byte), and one extra byte for the expdate column (a tiny average for the stored data). All three values rounded up from “1 and a bit” to 2.

The avg_row_len, however, has gone up by only 2 – which I am going to assume is the two newlength bytes, ,and with no allowance for the impact of the one row in 10,000 that is now a few bytes longer. It looks as if the rounding rules for the row length may be different from the rounding (up) rules for the column length.

April 11, 2022

Index Upgrade

Filed under: 19c,Indexing,Oracle,Statistics — Jonathan Lewis @ 4:55 pm BST Apr 11,2022

Sometimes wishes come true and in 19c – with fix_control QKSFM_DBMS_STATS_27268249 – one of mine did. The description of this fix (which is enabled by default) is: “use approximate ndv for computing leaf blocks and distinct keys”.

Here’s a key item in the output file from running tkprof against the trace file generated by a simple call to:

execute dbms_stats.gather_index_stats(user,'t1_i2')

The index is a two_column index on t1(x1, x2) with a size of roughly 16,000 blocks on a table of approximately 6 million rows.

select /*+ opt_param('_optimizer_use_auto_indexes' 'on')
  no_parallel_index(t, "T1_I2")  dbms_stats cursor_sharing_exact
  use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl
  opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  no_expand
  index_ffs(t,"T1_I2") */ count(*) as nrw,
  approx_count_distinct(sys_op_lbid(106818,'L',t.rowid)) as nlb,
  approx_count_distinct(sys_op_combined_hash("X1","X2")) as ndk,
  sys_op_countchg(substrb(t.rowid,1,15),1) as clf
from
 "TEST_USER"."T1" t where "X1" is not null or "X2" is not null

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE APPROX (cr=15821 pr=0 pw=0 time=2812116 us starts=1)
   6018750    6018750    6018750   INDEX FAST FULL SCAN T1_I2 (cr=15821 pr=0 pw=0 time=894658 us starts=1 cost=2117 size=192000000 card=6000000)(object id 106818)


The first point of interest is the appearance of the approx_count_distinct() function calls used for the nlb (number of leaf blocks) and ndk (number of distinct keys) columns. It’s also worth nothing that the ndk value is derived from a call to sys_op_combined_hash() applied to the two base columns which means the number of distinct keys for a multi-column index is calculated in exactly the same way as the number of distinct values for a column group.

There are two more important details though: first that the mechanism uses a fast full scan of the whole index, secondly that the size of this index is about 16,000 blocks.

A final (unrelated) point is the little reminder in the hints that 19c includes an automatic indexing mechanism. It’s easy to forget such things when your overnight batch job takes longer than usual.

For comparison purposes, the following shows the effect of disabling the feature:

alter session set "_fix_control"='27268249:0';


select /*+ opt_param('_optimizer_use_auto_indexes' 'on')
  no_parallel_index(t, "T1_I2")  dbms_stats cursor_sharing_exact
  use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl
  opt_param('optimizer_inmemory_aware' 'false') no_substrb_pad  no_expand
  index_ffs(t,"T1_I2") */ count(*) as nrw,count(distinct sys_op_lbid(106818,
  'L',t.rowid)) as nlb,count(distinct hextoraw(sys_op_descend("X1")
  ||sys_op_descend("X2"))) as ndk,sys_op_countchg(substrb(t.rowid,1,15),1) as
  clf
from
 "TEST_USER"."T1" sample block (  7.0114135742,1)  t where "X1" is not null
  or "X2" is not null

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT GROUP BY (cr=1132 pr=0 pw=0 time=460459 us starts=1)
    421761     421761     421761   INDEX SAMPLE FAST FULL SCAN T1_I2 (cr=1132 pr=0 pw=0 time=67203 us starts=1 cost=150 size=8413700 card=420685)(object id 106818)

The calculations for nlb and ndk are simple count()s and the thing that ndk counts is a messy concatenation of the columns hextoraw(sys_op_descend(“X1”) || sys_op_descend(“X2”)) that Oracle has used to ensure that counts for like ‘AB’ || ‘CD’ and ‘ABC’||’D’ don’t get combined.

Perhaps most significantly for some people is that the execution plan shows us that the index fast full scan was a SAMPLE and only analyzed (a fairly typical) 1,132 blocks out of 16,000 and 400,000 rows out of 6 million This looks a bit of a threat, of course; but there may be a few critical indexes where this extra workload will stop random variations in execution plans when it really matters.

As with so many details of Oracle there are likely to be cases where the new method is hugely beneficial, and some where it’s a nuisance, so it’s good to know that you can be a little selective about when it gets used.

Footnote

Don’t forget that it’s a good idea to change think about setting the table preference “table_cached_blocks” to allow Oracle to produce a better value for the clustering_factor. This is another mechanism that increases the CPU required to gather index stats.

It’s an odd little detail that the fixed control appeared in 19.3.0.0 according to my histogram of v$system_fix_control and certainly wasn’t in 18.3.0.0 – but the entry in the view thinks that it was available from Oracle 8.0.0.0.

January 26, 2022

system_stats() hint

Filed under: 18c,19c,Hints,Oracle,Statistics,System Stats — Jonathan Lewis @ 9:46 am GMT Jan 26,2022

Starting from 18.3 Oracle introduced the system_stats() hint, apparently allowing you to set the system statistics for the duration of a query. However the hint didn’t seem to have any effect in that version of Oracle – even though the fix_control that seemed to be the most relevant (QKSFM_DBMS_STATS_24952618) was set to 1, so maybe the hint was acting strictly according to the fix control description, which was: “turn on Exadata stats: MBRC,IOTFRSPEED,IOSEEKTIME” (or maybe the fix control had nothing to do with the hint)

According to my notes I had a test that showed it working on live SQL, which (in my notes) I said was running 19.2 at the time; however, I can’t get it to work on 19.11.0.0 or 21.3.0.0 on a Linux VM (or on the current Live SQL version) despite a load of fiddling with potentially relevant hidden parameters, fix controls, and numeric event numbers. So maybe it is only for Exadata.

It’s not documented, of course, but I’m fairly confident I’m using the correct syntax – which was quite easy to find (sometimes you get lucky) because a search through the binary for the hint text produced a perfect result:


[oracle@linux183 bin]$ strings -a oracle | grep -T -n  -i system_stats\(
1762556:BEGIN :1 := dbms_stats_internal.store_system_stats(:2, :3, :4); END;
1787190:system_stats(mbrc=%f ioseektim=%f iotfrspeed=%f)

So it would seem (from line 1787190) that we can override three of the system statistics: mbrc, ioseektim, and iotfrspeed. Thanks to the hint_report option that 19c introduced to dispay_xxxxxx() calls in dbms_xplan it’s easy to see that this syntax is correct but unused. From a call to dbms_xplan.display_cursor() in 19.11.0.0:

select  /*+ system_stats(mbrc=128 ioseektim=1 iotfrspeed=262144) */ count(*) from t1

Plan hash value: 3724264953

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |       |  2732 (100)|          |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 50000 |  2732   (1)| 00:00:01 |
-------------------------------------------------------------------

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (U - Unused (1))
---------------------------------------------------------------------------
   0 -  STATEMENT
         U -  system_stats(mbrc=128 ioseektim=1 iotfrspeed=262144)

Other tests reported shorter versions of the hint (e.g. /*+ system_stats(mbrc=128) */ ) as errors:


Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (E - Syntax error (1))
---------------------------------------------------------------------------
   1 -  SEL$1
         E -  system_stats

In passing, it’s interesting to note that the text was reported as a “query block” hint (sel$1) when it had a syntax error despite being a “statement-level” hint when it was recognised. Presumably the generic parsing rule is: “it’s a query block hint unless proved otherwise”.

The call to dbms_stat_internal.store_system_stats() that also dropped out of the scan of the executable looks as if it’s the function that sets the “noworkload” statistics – the three parameters are, in order: ioseektim, iotfrspeed, cpuspeednw – but possibly it’s the internal call used when you use the ‘EXADATA’ option for gathering system stats.

Bottom line:

Maybe there’s a way to switch this hint on to override the default system stats; maybe it just needs to be run on Exadata; and maybe – if it can be switched on – it could be attached as an SQL_Patch.  Experimentation left to readers who have access to an Exadata system, any results are welcome.

December 3, 2021

Column Groups

Filed under: CBO,Oracle,Statistics — Jonathan Lewis @ 10:40 am GMT Dec 3,2021

Here’s a little test that should have been the second test I ran when column groups were first made available in Oracle. It’s a check on the second of the two basic algorithms that Oracle uses for combining predicates:

  1. sel(predA and predB) = sel(predA) * sel(predB)
  2. sel(predA or predB) = sel(predA) + sel(predB) – sel(predA and predB)

As I pointed out many years ago in Cost Based Oracle – Fundamentals these are the formulae for combining probabilities of independent events. But if the predicates are not independent the formulae make the optimizer produce bad cardinality estimates and that’s why the Oracle developers introduced column groups – virtual columns that gather statistics about combinations of columns.

So let’s set up an extreme example and test the results (I’ve run it one 21c and 19c, but the issue is the same for all versions with column groups):

rem
rem     Script:         column_group_and_or.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2021
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        mod(rownum,100)                 n1,
        mod(rownum,100)                 n2,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator 
;

prompt  =========================================
prompt  Run the script twice, 
prompt  first without gathering the column group
prompt  then after gathering the column group
prompt  =========================================

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for columns (n1, n2) size 1'
        );
end;
/

column column_name format a32

select
        column_name, num_distinct, data_default
from
        user_tab_cols
where
        table_name = 'T1'
/

This script gives you a table with 10,000 rows, where the n1 and n2 columns always have exactly the same values as each other, and each column has 100 rows each of 100 distinct values. If you allow the pl/sql block to run the column stats query will report the following:

COLUMN_NAME                      NUM_DISTINCT DATA_DEFAULT
-------------------------------- ------------ --------------------------------------------------------------------------------
N1                                        100
N2                                        100
V1                                      10000
PADDING                                     1
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS            100 SYS_OP_COMBINED_HASH("N1","N2")

5 rows selected.

After creating the data (and statistics) we can execute the following simple statements:

set autotrace traceonly explain

select  * 
from    t1
where
        n1 = 50
/

select  * 
from    t1
where
        n1 = 50
and     n2 = 50
/

select  *
from    t1
where
        n1 = 50
or      n2 = 50
/

set autotrace off

We know that the first query will return 100 rows, and that the optimizer has enough information to get the correct estimate: the selectivity of “n1 = 50” is 1/100, the number of rows in the table is 10,000, so the estimate should be 1/100 * 10,000 = 100.

For the second query: if we don’t create the column group we expect the optimizer to multiply the selectivities of the two predicates together before multiplying by the number of rows in the table to get an estimate of (1/100 * 1/100) * 10,000 = 1

If we’ve created the column group we expect the optimizer to use the column group selectivity to calculate its estimate (1/100 * 10,000) = 100.

Then we get to the third query: if we don’t create the column group we expect the optimizer to slot 1/100 into the 4 places it has to appear in the formula to get an estimate (1/100 + 1/100 – (1/100 * 1/100)) * 10,000 = 199.

If we have created the column group the third expression in the formula can use the column group 1/num_distinct for the third expression to get: (1/100 + 1/100 – (1/100)) * 10,000 = 100

Here are the three execution plans (with a little cosmetic work) when the column group exists:

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 11800 |    26   (4)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 11800 |    26   (4)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50)


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   100 | 11800 |    26   (4)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   100 | 11800 |    26   (4)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50 AND "N2"=50)


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   199 | 23482 |    27   (8)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   199 | 23482 |    27   (8)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=50 OR "N2"=50)

As you can see from the results in the Rows column, the optimizer has used the column group statistics for the AND estimate, but not for the OR estimate, resulting in an estimate that is out by a factor of nearly 2.

It should be noted that failure to use the column group for AND’ed predicates can introduce a massive error in cardinality estimates but for OR’ed predicates the worst it can be is a factor of 2.

Summary

There is a possible enhancement to the the optimizer’s use of column group statistics that could make the arithmetic consistent for AND and OR predicates and could allow the optimizer to produce better estimates for OR conditions.

Although any error in cardinality estimates can lead to undesirable execution plans the scale of the error due to the current inconsistency will be at worst a factor or 2, which is unlikely to introduce many surprise plans. The nature of the error means the estimate will be too big, which might push the optimizer into using tablescans and hash joins when indexed accesses and nested loops might be the better choice.

September 20, 2021

Optimizer Tip

Filed under: CBO,extended stats,Indexing,Oracle,Statistics — Jonathan Lewis @ 9:04 am BST Sep 20,2021

This is a note I drafted in March 2016, starting with a comment that it had been about the same time the previous year that I had written:

I’ve just responded to the call for items for the “IOUG Quick Tips” booklet for 2015 – so it’s probably about time to post the quick tip that I put into the 2014 issue. It’s probably nothing new to most readers of the blog, but sometimes an old thing presented in a new way offers fresh insights or better comprehension.

I keep finding ancient drafts like this and if they still seem relevant – even if they are a little behind the times – I’ve taken to dusting them down and publishing. (There are still more than 730 drafts on my blog at present – which gives me scope for one per day for the next 2 years!)

With the passing of time, though, new information becomes available, algorithms change and (occasionally) I discover I’ve made a significant error in a hypothesis (or guess). In this case there are a couple of important additions that I’ve added to the end of the original note.

Optimizer Tips (IOUG Quick Tips 2015)

There are two very common reasons why the optimizer picks a bad execution plan. The first is that its estimate of the required data volume is bad, the second is that it has a misleading impression of how scattered that data is.

The first issue is often due to problems with the selectivity of complex predicates, the second to unsuitable values for the clustering_factor of potentially useful indexes. Recent [ed: i.e. as at 2015] versions of the Oracle software have given us features that try to address both these issues and I’m going to comment on them in the following note.

As always any change can have side effects; introducing a new feature might have no effect on 99% of what we do, a beneficial effect on 99% of the remainder, and a hideous effect on the 1% of 1% that’s left, so I will be commenting on both the pros and cons of both features.

Column Group Stats

The optimizer assumes that the data in two different columns of a single table are independent – for example the registration number on your car (probably) has nothing to do with the account number of your bank account. So when we execute queries like:

     colX = 'abcd'
and  colY = 'wxyz'

the optimizer’s calculations will be something like:

“one row in 5,000 is likely to have colX = ‘abcd’ and one row in 2,000 is likely to have colY = ‘wxyz’, so the combination will probably appear in roughly one row in ten million”.

On the other hand we often find tables that do things like storing post codes (zipcodes) in one column and city names in another, and there’s a strong correlation between post codes and city – for example the district code (first part of the post code) “OX1” will be in the city of Oxford (Oxfordshire, UK). So if we query a table of addresses for rows where:

     district_code = 'OX1'
and  city          = 'Oxford'

there’s a degree of redundancy, but the optimizer will multiply the total number of distinct district codes in the UK by the total number of distinct city names in the UK as it tries to work out the number of addresses that match the combined predicate and will come up with a result that is far too small.

In cases like this we can define “column group” statistics about combinations of columns that we query together, using the function dbms_stats.create_extended_stats(). This function will create a type of virtual column for a table and report the system-generated name back to us, and we will be able to see that name in the view user_tab_cols, and the definition in the view user_stat_extensions. If we define a column group in this way we then need to gather stats on it, which we can do in one of two ways, either by using the generated name or by using the expression that created it.


SQL> create table addresses (district_code varchar2(8), city varchar2(40));

Table created.

SQL> execute dbms_output.put_line( -
>        dbms_stats.create_extended_stats( -
>            user,'addresses','(district_code, city)'))

SYS_STU12RZM_07240SN3V2667EQLW

PL/SQL procedure successfully completed.

begin
        dbms_stats.gather_table_stats(
                user, 'addresses',
                method_opt => 'for columns SYS_STU12RZM_07240SN3V2667EQLW size 1'
        );
        dbms_stats.gather_table_stats(
                user, 'addresses',
                method_opt => 'for columns (district_code, city) size 1'
        );
end;
/

I’ve included both options in the anonymous pl/sql block, but you only need one of them. In fact if you use the second one without calling create_extended_stats() first Oracle will create the column group implicitly, but you won’t know what it’s called until you query user_stat_extensions.

I’ve limited the stats collection to basic stats with the “size 1” option. You can collect a histogram on a column group but since the optimizer can only use a column group with equality predicates you should only create a histogram in the special cases where you know that you’re going to get a frequency histogram or “Top-N” histogram.

You can also define extended stats on expressions (e.g. trunc(delivery-date) – trunc(collection_date)) rather than column groups, but since you’re only allowed 20 column groups per table [but see update 1] it would be better to use virtual columns for expressions since you can have as many virtual columns as you like on a table provided the total column count stays below the limit of 1,000 columns per table.

Warnings

  • Column group statistics are only used for equality expressions. [see also update 2]
  • Column group statistics will not be used if you’ve created a histogram on any of the underlying columns unless there’s also a histogram on the column group itself.
  • Column group statistics will not be used if you query any of the underlying columns with an “out of range” value. This, perhaps, is the biggest instability threat with column groups. As time passes and new data appears you may find people querying the new data. If you haven’t kept the column stats up to date then plans can change dramatically as the optimizer switches from using column group stats to multiplying the selectivities of underlying columns.
  • The final warning arrives with 12c. If you have all the adaptive optimizer options enabled the optimizer will keep a look out for tables that it thinks could do with column group stats, and automatically creates them the next time you gather stats on the table. In principle this shouldn’t be a problem – the optimizer should only do this when it has seen that column group stats should improve performance – but you might want to monitor your system for the arrival of new automatic columns.

Preference: table_cached_blocks

Even when the cardinality estimates are correct we may find that we get an inefficient execution plan because the optimizer doesn’t want to use an index that we think would be a really good choice. A common reason for this failure is that the clustering_factor on the index is unrealistically large.

The clustering_factor of an index is a measure of how randomly you will jump around the table as you do an index range scan through the index and the algorithm Oracle uses to calculate this measure has a serious flaw in it: it can’t tell the difference between a little bit of localised jumping and constant random leaps across the entire width of the table.

To calculate the clustering_factor Oracle basically walks the entire index in order using the rowid at the end of each index entry to check which table block it would have to visit, and every time it has to visit a “new” table block it increments a counter. The trouble with this approach is that, by default, it doesn’t remember its recent history so, for example, it can’t tell the difference in quality between the following two sequences of table block visits:

Block 612, block 87, block 154, block  3, block 1386, block 834, block 237
Block  98, block 99, block  98, block 99, block   98, block  99, block  98

In both cases Oracle would say that it had visited seven different blocks and the data was badly scattered. This has always been a problem, but it became much more of a problem when Oracle introduced ASSM (automatic segment space management). The point of ASSM is to ensure that concurrent inserts from different sessions tend to use different table blocks, the aim being to reduce contention due to buffer busy waits. As we’ve just seen, though, the clustering_factor doesn’t differentiate between “a little bit of scatter” and “a totally random disaster area”.

Oracle finally addressed this problem by introducing a “table preference” which allows you to tell it to “remember history” when calculating the clustering_factor. So, for example, a call like this:

execute dbms_stats.set_table_prefs(user,'t1','table_cached_blocks',16)

would tell Oracle that the next time you collect statistics on any indexes on table t1 the code to calculate the clustering_factor should remember the last 16 table blocks it had “visited” and not increment the counter if the next block to visit was already in that list.

If you look at the two samples above, this means the counter for the first list of blocks would reach 7 while the counter for the second list would only reach 2. Suddenly the optimizer will be able to tell the difference between data that is “locally” scattered and data that really is randomly scattered. You and the optimizer may finally agree on what constitutes a good index.

It’s hard to say whether there’s a proper “default” value for this preference. If you’re using ASSM (and there can’t be many people left who aren’t) then the obvious choice for the parameter would be 16 since ASSM tends to format 16 consecutive blocks at a time when a segment needs to make more space available for users [but see Update 3]. However, if you know that the real level of insert concurrency on a table is higher than 16 then you might be better off setting the value to match the known level of concurrency.

Are there any special risks to setting this preference to a value like 16? I don’t think so; it’s going to result in plans changing, of course, but indexes which should have a large clustering_factor should still end up with a large clustering_factor after setting the preference and gathering statistics; the indexes that ought to have a low clustering_factor are the ones most likely to change, and change in the right direction.

Footnote: “Danger, Will Robinson”.

I’ve highlighted two features that are incredibly useful as tools to give the optimizer better information about your data and allow it to get better execution plans with less manual intervention. The usual warning applies, though: “if you want to get there, you don’t want to start from here”. When you manipulate the information the optimizer is using it will give you some new plans; better information will normally result in better plans but it is almost inevitable that some of your current queries are running efficiently “by accident” (possibly because of bugs) and the new code paths will result in some plans changing for the worse.

Clearly it is necessary to do some thorough testing but fortunately both features are incremental and any changes can be backed out very quickly and easily. We can change the table_cached_blocks one table at a time (or even, with a little manual intervention, one index at a time) and watch the effects; we can add column groups one at a time and watch for side effects. All it takes to back out of a change is a call to gather index stats, or a call to drop extended stats. It’s never nice to live through change – especially change that can have a dramatic impact – but if we find after going to production that we missed a problem with our testing we can reverse the changes very quickly.

Updates

Update 1 – 20 sets of extended stats. In fact the limit is the larger of 20 and ceiling(column count/10), and the way the arithmetic is applied is a little odd so there are ways to hack around the limit.

Update 2 – Column groups and equality. It’s worth a special menton that the predicate colX is null is not an equality predicate, and column group stats will not apply but there can be unexpected side effects even for cases where you don’t use this “is null” predicate. (More articles here about column groups.)

Update 3 – table_cached_blocks = 16. This suggestions doesn’t cater for systems running RAC.

August 9, 2021

Preferences

Filed under: CBO,Oracle,Statistics — Jonathan Lewis @ 12:58 pm BST Aug 9,2021

I’ve made a few comments in the past about setting “table preferences” for stats collection – most significantly the table_cached_blocks preference that affects the calculation of the clustering_factor of all the indexes on the table, the incremental preference for dictating the strategy used for dealing with partitioned tables, and the method_opt preference for dictating precise requirments for histograms.

If you want to check the current preferences set for a table you can query the XXXX_tab_stat_prefs views. For some reason in the dim and distant past (perhaps in a beta release before the views had been created but perhaps because the views show only the preferences that have been set) I wrote a little script to report all the possible table preferences showing both the table value and the current global value.

rem
rem     Script: get_table_prefs.sql
rem     Dated:  ???
rem     Author: Jonathan Lewis
rem
rem     Last tested
rem             19.11.0.0
rem
rem     Notes
rem     Report the table preferences for a given
rem     owner and table.
rem
rem     Needs to find a list of all legal preferences.
rem             Global prefs are in:    optstat_hist_control$ (sname, spare4)
rem             Table prefs are in:     optstat_user_prefs$ (valchar / valnum)
rem
rem     The public view is dba_tab_stat_prefs / user_tab_stat_prefs.
rem     But if a table has no prefs set there are no rows in the view
rem
rem     This script currently has to be run by sys or a user with 
rem     the select privileges on sys.optstat_hist_control$ (and
rem     execute on dbms_stats).
rem

define m_owner = '&enter_schema'
define m_table = '&enter_tablename'


<<anon_block>>
declare
        pref_count      number(2,0) := 0;
begin
        dbms_output.new_line;
        dbms_output.put_line(
                        rpad('Preference',32) || ' ' ||
                        rpad('Table value',32) || ' ' ||
                        '[Global value]'
        );
        dbms_output.put_line(
                        rpad('=',32,'=') || ' ' ||
                        rpad('=',32,'=') || ' ' ||
                        '================================'
        );
        for c1 in (
                select  sname, spare4 
                from    sys.optstat_hist_control$
                where   spare4 is not null
        ) loop
                anon_block.pref_count := anon_block.pref_count + 1;
                
                dbms_output.put_line(
                        rpad(c1.sname,32) || ' ' ||
                        rpad(dbms_stats.get_prefs(c1.sname,'&m_owner','&m_table'),32) || ' ' 
                        || '[' || c1.spare4 || ']'
                );      

        end loop;
        dbms_output.new_line;
        dbms_output.put_line('Preferences reported: ' || anon_block.pref_count);
end;
/

While I’ve hardly ever used the script – and so haven’t considered reviewing the strategy it uses – the benefit of having it around means that when I have run it I’ve occasionally discovered new preferences that I hadn’t previously noticed (and ought to investigate).

Here’s a sample of the output – from a table with no special settings for preferences:

Preference                       Table value                      [Global value]
================================ ================================ ================================
TRACE                            0                                [0]
DEBUG                            0                                [0]
SYS_FLAGS                        1                                [1]
SPD_RETENTION_WEEKS              53                               [53]
CASCADE                          DBMS_STATS.AUTO_CASCADE          [DBMS_STATS.AUTO_CASCADE]
ESTIMATE_PERCENT                 DBMS_STATS.AUTO_SAMPLE_SIZE      [DBMS_STATS.AUTO_SAMPLE_SIZE]
DEGREE                           NULL                             [NULL]
METHOD_OPT                       FOR ALL COLUMNS SIZE AUTO        [FOR ALL COLUMNS SIZE AUTO]
NO_INVALIDATE                    DBMS_STATS.AUTO_INVALIDATE       [DBMS_STATS.AUTO_INVALIDATE]
GRANULARITY                      AUTO                             [AUTO]
PUBLISH                          TRUE                             [TRUE]
STALE_PERCENT                    10                               [10]
APPROXIMATE_NDV                  TRUE                             [TRUE]
APPROXIMATE_NDV_ALGORITHM        REPEAT OR HYPERLOGLOG            [REPEAT OR HYPERLOGLOG]
ANDV_ALGO_INTERNAL_OBSERVE       FALSE                            [FALSE]
INCREMENTAL                      FALSE                            [FALSE]
INCREMENTAL_INTERNAL_CONTROL     TRUE                             [TRUE]
AUTOSTATS_TARGET                 AUTO                             [AUTO]
CONCURRENT                       OFF                              [OFF]
JOB_OVERHEAD_PERC                1                                [1]
JOB_OVERHEAD                     -1                               [-1]
GLOBAL_TEMP_TABLE_STATS          SESSION                          [SESSION]
ENABLE_TOP_FREQ_HISTOGRAMS       3                                [3]
ENABLE_HYBRID_HISTOGRAMS         3                                [3]
TABLE_CACHED_BLOCKS              1                                [1]
INCREMENTAL_LEVEL                PARTITION                        [PARTITION]
INCREMENTAL_STALENESS            ALLOW_MIXED_FORMAT               [ALLOW_MIXED_FORMAT]
OPTIONS                          GATHER                           [GATHER]
GATHER_AUTO                      AFTER_LOAD                       [AFTER_LOAD]
STAT_CATEGORY                    OBJECT_STATS, REALTIME_STATS     [OBJECT_STATS, REALTIME_STATS]
SCAN_RATE                        0                                [0]
GATHER_SCAN_RATE                 HADOOP_ONLY                      [HADOOP_ONLY]
PREFERENCE_OVERRIDES_PARAMETER   FALSE                            [FALSE]
AUTO_STAT_EXTENSIONS             OFF                              [OFF]
WAIT_TIME_TO_UPDATE_STATS        15                               [15]
ROOT_TRIGGER_PDB                 FALSE                            [FALSE]
COORDINATOR_TRIGGER_SHARD        FALSE                            [FALSE]
MAINTAIN_STATISTICS_STATUS       FALSE                            [FALSE]
AUTO_TASK_STATUS                 OFF                              [OFF]
AUTO_TASK_MAX_RUN_TIME           3600                             [3600]
AUTO_TASK_INTERVAL               900                              [900]

Preferences reported: 41

As the notes that I’ve left in-line say: this version of the script has to be run by SYS or a DBA because of the privileges required.

You might notice, by the way , that this is one of those rare cases where I’ve remembered to use a label to name the PL/SQL block, and then used the label to qualify the name of a variable I’ve used inside the block.

February 16, 2021

Adaptive error

Filed under: CBO,dbms_xplan,Oracle,Statistics — Jonathan Lewis @ 5:41 pm GMT Feb 16,2021

There’s a thread on the Oracle Database Forum at present where someone has supplied a script to create some data that’s guaranteed to reproduce wrong results (provided your system stats and optimizer parameters are at their default values). They’ve even supplied a link to the script on LiveSQL (opens in new window) – which is running 19.8 – to demonstrate the problem.

I’ve tested on 12.2.0.1 and 19.3.0.0 and the problem occurs in both versions – though with my setup the initial plan that returned the wrong results didn’t re-optimize to a plan with the correct results in 12.2.0.1.

I’ve included a script at the end of the note to create the data set but I’ll describe some of the objects as we go along – starting with a query that gives the correct result, apparently because it’s been hinted to do so:

execute dbms_stats.delete_system_stats

set linesize 255
set pagesize  60
set trimspool on

alter session set statistics_level='all';
set serveroutput off

select 
        /*+ use_hash(dwf) */ 
        count(*) count_hash 
from 
        test_dwf_sapfi  dwf
where
         exists (
                select  1 
                from    test_sapfi_coicar_at5dat11      coi
                where   coi.datumzprac = 20200414
                and     to_char(coi.datuct,'yyyymmdd') = dwf.datumucetnipom_code
        );

select * from table(dbms_xplan.display_cursor(format=>'cost outline allstats last partition hint_report adaptive'));


test_dwf_sapfi is a table with a single numeric column datumucetnipom_code, the table is list partitioned by that column with 61 partitions. Each partition is defined to hold a single value. The number is designed to look like a date in the format YYYYMMDD.

test_sapfi_coicar_at5dat11 is a table with two columns (datuct, datumzprac). The first column is a date column with data covering a range of 60 dates, the second column is a numeric column and the table is list partioned on that column. All the data in the table is in one partition of that table and the column holds the same value for every row (again it’s a number that looks like a date).

There are 15,197 rows in each table, and the test_dwf_sapfi data has been created as a copy (with a suitable to_number(to_char()) formatting change from the test_sapfi_coicar_at5dat11 table.

Here’s the execution plan from 19c:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                       | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                            |      1 |        |   328 (100)|       |       |      1 |00:00:00.02 |     155 |     69 |       |       |          |
|   1 |  SORT AGGREGATE              |                            |      1 |      1 |            |       |       |      1 |00:00:00.02 |     155 |     69 |       |       |          |
|*  2 |   HASH JOIN RIGHT SEMI       |                            |      1 |    253 |   328   (1)|       |       |  15197 |00:00:00.02 |     155 |     69 |  2352K|  2352K| 2110K (0)|
|   3 |    PART JOIN FILTER CREATE   | :BF0000                    |      1 |    152 |    13   (0)|       |       |  15197 |00:00:00.01 |      25 |      0 |       |       |          |
|   4 |     PARTITION LIST SINGLE    |                            |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |      0 |       |       |          |
|   5 |      TABLE ACCESS FULL       | TEST_SAPFI_COICAR_AT5DAT11 |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |      0 |       |       |          |
|   6 |    PARTITION LIST JOIN-FILTER|                            |      1 |  15197 |   314   (1)|:BF0000|:BF0000|  15197 |00:00:00.01 |     130 |     69 |       |       |          |
|   7 |     TABLE ACCESS FULL        | TEST_DWF_SAPFI             |     60 |  15197 |   314   (1)|:BF0000|:BF0000|  15197 |00:00:00.01 |     130 |     69 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$5DA710D3")
      UNNEST(@"SEL$2")
      OUTLINE(@"SEL$1")
      OUTLINE(@"SEL$2")
      FULL(@"SEL$5DA710D3" "DWF"@"SEL$1")
      FULL(@"SEL$5DA710D3" "COI"@"SEL$2")
      LEADING(@"SEL$5DA710D3" "DWF"@"SEL$1" "COI"@"SEL$2")
      USE_HASH(@"SEL$5DA710D3" "COI"@"SEL$2")
      SWAP_JOIN_INPUTS(@"SEL$5DA710D3" "COI"@"SEL$2")
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DWF"."DATUMUCETNIPOM_CODE"=TO_NUMBER(TO_CHAR(INTERNAL_FUNCTION("COI"."DATUCT"),'yyyymmdd')))


Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (N - Unresolved (1))
---------------------------------------------------------------------------
   7 -  SEL$5DA710D3 / DWF@SEL$1
         U -  use_hash(dwf)



You’ll notice there’s no “adaptive” information in the report, and there’s no “Note” section saying it’s an adaptive plan. You might also note that the plan looks as if it’s doing a hash join into “dwf” but the “Hint Report” tells us that the hint has not been used and the “Outline Information” tells us that the plan has actually arrived as the result of the combination /*+ use_hash(coi) swap_join_inputs(coi)” */. In fact this is the default plan (on my system) that would have appeared in the complete absence of hints.

The result of the count(*) should be 15,197 – and you can see that this plan has produced the right answer when you check the A-Rows value for operation 2 (the hash join right semi that generates the rowsource for the sort aggregate).

The adaptive anomaly

So now we try again but with a hint to generate a nested loop join and it gives us the wrong result (8) and an oddity in the plan. I’ve reported the body of the plan twice, the first version includes the adaptive information the second is the tidier plan we get by omitting the ‘adaptive’ format option:

select 
        count(*) count_nl 
from 
        test_dwf_sapfi  dwf
where 
        exists (
                select
                        /*+
                                use_nl (coi)
                        */
                        1
                from    test_sapfi_coicar_at5dat11      coi
                where   coi.datumzprac = 20200414
                and     to_char(coi.datuct,'yyyymmdd') = dwf.datumucetnipom_code
        )
;

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   Id  | Operation                    | Name                       | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|     0 | SELECT STATEMENT             |                            |      1 |        |   329 (100)|       |       |      1 |00:00:00.01 |     154 |       |       |          |
|     1 |  SORT AGGREGATE              |                            |      1 |      1 |            |       |       |      1 |00:00:00.01 |     154 |       |       |          |
|- *  2 |   HASH JOIN                  |                            |      1 |  38491 |   329   (1)|       |       |      8 |00:00:00.01 |     154 |  3667K|  1779K|          |
|     3 |    PART JOIN FILTER CREATE   | :BF0000                    |      1 |  38491 |   329   (1)|       |       |      8 |00:00:00.01 |     154 |       |       |          |
|     4 |     NESTED LOOPS             |                            |      1 |  38491 |   329   (1)|       |       |      8 |00:00:00.01 |     154 |       |       |          |
|-    5 |      STATISTICS COLLECTOR    |                            |      1 |        |            |       |       |     60 |00:00:00.01 |      25 |       |       |          |
|     6 |       SORT UNIQUE            |                            |      1 |    152 |    13   (0)|       |       |     60 |00:00:00.01 |      25 |  4096 |  4096 | 4096  (0)|
|     7 |        PARTITION LIST SINGLE |                            |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|     8 |         TABLE ACCESS FULL    | TEST_SAPFI_COICAR_AT5DAT11 |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|     9 |      PARTITION LIST ITERATOR |                            |     60 |    253 |   314   (1)|   KEY |   KEY |      8 |00:00:00.01 |     129 |       |       |          |
|  * 10 |       TABLE ACCESS FULL      | TEST_DWF_SAPFI             |     60 |    253 |   314   (1)|   KEY |   KEY |      8 |00:00:00.01 |     129 |       |       |          |
|-   11 |    PARTITION LIST JOIN-FILTER|                            |      0 |  15197 |   314   (1)|:BF0000|:BF0000|      0 |00:00:00.01 |       0 |       |       |          |
|-   12 |     TABLE ACCESS FULL        | TEST_DWF_SAPFI             |      0 |  15197 |   314   (1)|:BF0000|:BF0000|      0 |00:00:00.01 |       0 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |      1 |        |   329 (100)|       |       |      1 |00:00:00.01 |     154 |       |       |          |
|   1 |  SORT AGGREGATE            |                            |      1 |      1 |            |       |       |      1 |00:00:00.01 |     154 |       |       |          |
|   2 |   PART JOIN FILTER CREATE  | :BF0000                    |      1 |  38491 |   329   (1)|       |       |      8 |00:00:00.01 |     154 |       |       |          |
|   3 |    NESTED LOOPS            |                            |      1 |  38491 |   329   (1)|       |       |      8 |00:00:00.01 |     154 |       |       |          |
|   4 |     SORT UNIQUE            |                            |      1 |    152 |    13   (0)|       |       |     60 |00:00:00.01 |      25 |  4096 |  4096 | 4096  (0)|
|   5 |      PARTITION LIST SINGLE |                            |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|   6 |       TABLE ACCESS FULL    | TEST_SAPFI_COICAR_AT5DAT11 |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|   7 |     PARTITION LIST ITERATOR|                            |     60 |    253 |   314   (1)|   KEY |   KEY |      8 |00:00:00.01 |     129 |       |       |          |
|*  8 |      TABLE ACCESS FULL     | TEST_DWF_SAPFI             |     60 |    253 |   314   (1)|   KEY |   KEY |      8 |00:00:00.01 |     129 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      FULL(@"SEL$5DA710D3" "DWF"@"SEL$1")
      USE_NL(@"SEL$5DA710D3" "DWF"@"SEL$1")
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$5DA710D3")
      UNNEST(@"SEL$2")
      OUTLINE(@"SEL$1")
      OUTLINE(@"SEL$2")
      FULL(@"SEL$5DA710D3" "COI"@"SEL$2")
      LEADING(@"SEL$5DA710D3" "COI"@"SEL$2" "DWF"@"SEL$1")
      SEMI_TO_INNER(@"SEL$5DA710D3" "COI"@"SEL$2")
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DWF"."DATUMUCETNIPOM_CODE"=TO_NUMBER(TO_CHAR(INTERNAL_FUNCTION("COI"."DATUCT"),'yyyymmdd')))
  10 - filter("DWF"."DATUMUCETNIPOM_CODE"=TO_NUMBER(TO_CHAR(INTERNAL_FUNCTION("COI"."DATUCT"),'yyyymmdd')))

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (U - Unused (1))
---------------------------------------------------------------------------
   8 -  SEL$5DA710D3 / COI@SEL$2
         U -  use_nl (coi)

Note
-----
   - this is an adaptive plan (rows marked '-' are inactive)

Points to note here:

  • The most important item to note is that at operation 3 (of the tidy plan) we can see that the nested loop reports A-Rows as 8, it’s the wrong result.
  • Then there’s the oddityy that operation 2 is a “part join filter create” that shouldn’t be there for a nested loop, that’s a hash join feature that allows the Pstart/Pstop columns to report partition pruning by Bloom filter (“:BFnnnn”), but we’re running a nested loop join which can pass in the partition key, so we see KEY/KEY as the Pstart/Pstop.
  • The third thing we can pick up is that the 8 rows in our nested loop rowsource are echoed in the A-Rows for the 60 executions of the partition table scans of test_dwf_sapfi at operations 7 abd 8 in the reduced plan – it’s probably not a complete coincidence that the nested loop join is passing the partition keys in partition key order (sort unique at operation 4) and there are 8 rows in the last populated partition of test_dwf_sapfi,
  • Finally we note from the Hint Report that the hint, as supplied, was not used, and the outlne shows us that the path was actually “leading(coi dwf) use_nl(dwf)”.

The really fascinating thing about this execution plan is that it contains a hint that was not used – but the plan changed from the default plan to a slightly more expensive plan.

If at first you don’t succeed

There’s just one more surprise to reveal – we had an adaptive plan, which tends to mean the optimizer plays towards a nested loop join but hedges its bets to be able to swing to a hash join in mid-plan. This suggests that the real-time stats collector thought there wasn’t much data and a nested loop was good – but what happens when I run exactly the same query again? In my 12c system the answer was nothing changed, but in my 19c system a new plan appeared:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                       | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                            |      1 |        |   331 (100)|       |       |      1 |00:00:00.01 |     154 |       |       |          |
|   1 |  SORT AGGREGATE              |                            |      1 |      1 |            |       |       |      1 |00:00:00.01 |     154 |       |       |          |
|*  2 |   HASH JOIN                  |                            |      1 |    120K|   331   (2)|       |       |  15197 |00:00:00.01 |     154 |  2171K|  2171K| 1636K (0)|
|   3 |    PART JOIN FILTER CREATE   | :BF0000                    |      1 |  15197 |    13   (0)|       |       |     60 |00:00:00.01 |      25 |       |       |          |
|   4 |     SORT UNIQUE              |                            |      1 |  15197 |    13   (0)|       |       |     60 |00:00:00.01 |      25 |  4096 |  4096 | 4096  (0)|
|   5 |      PARTITION LIST SINGLE   |                            |      1 |  15197 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|   6 |       TABLE ACCESS FULL      | TEST_SAPFI_COICAR_AT5DAT11 |      1 |  15197 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|   7 |    PARTITION LIST JOIN-FILTER|                            |      1 |  15197 |   314   (1)|:BF0000|:BF0000|  15197 |00:00:00.01 |     129 |       |       |          |
|   8 |     TABLE ACCESS FULL        | TEST_DWF_SAPFI             |     60 |  15197 |   314   (1)|:BF0000|:BF0000|  15197 |00:00:00.01 |     129 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      ALL_ROWS
      OUTLINE_LEAF(@"SEL$5DA710D3")
      UNNEST(@"SEL$2")
      OUTLINE(@"SEL$1")
      OUTLINE(@"SEL$2")
      FULL(@"SEL$5DA710D3" "COI"@"SEL$2")
      FULL(@"SEL$5DA710D3" "DWF"@"SEL$1")
      LEADING(@"SEL$5DA710D3" "COI"@"SEL$2" "DWF"@"SEL$1")
      USE_HASH(@"SEL$5DA710D3" "DWF"@"SEL$1")
      SEMI_TO_INNER(@"SEL$5DA710D3" "COI"@"SEL$2")
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DWF"."DATUMUCETNIPOM_CODE"=TO_NUMBER(TO_CHAR(INTERNAL_FUNCTION("COI"."DATUCT"),'yyyymmdd')))

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (U - Unused (1))
---------------------------------------------------------------------------
   6 -  SEL$5DA710D3 / COI@SEL$2
         U -  use_nl (coi)

Note
-----
   - statistics feedback used for this statement

This is the output with the ‘adaptive’ format in place – but the plan isn’t adaptive – the optimizer has used statistics feedback (formerly cardinality feedback)to work out a better plan. The hint is still unused of course but when we check the plan we can see that

  • it has got the right answer – the hash join at operation 2 reports 15,197 rows
  • the “partition join” Bloom filter created at operation 3 has been used for the Pstart/Pstop at operations 7 and 8
  • even though the hint has not been used the plan is (again) not the same as the default plan, we’ve got a hash join with Bloom filter while the default plan had a hash join right semi after a sort unique of the test_sapfi_coicar_at5dat11 data with an overall lower cost.

What Happened ?

Clearly there is a bug. It’s a slightly sensitive bug, and all I had to do to eliminate it was to gather stats on the underlying tables. (You’ll find in the table creation script at the end of this note that there are basically no object stats on the “big” partitioned table, which is presumably why the adaptive stuff came into play and allowed the bug to surface, and why 19c statistics feedback produced a new plan on the second execution)

It may be rather difficult for an outsider to pin down what’s going wrong and bypass the bug. One of the first ideas that appeared on the forum was that the Bloom filter pruning was breaking something – but when I added the hint /*+ opt_param(‘_bloom_pruning_enabled’,’false’) */ to the query all I got was basically the same nested loop plan without the Bloom filter creation and still ended up with the wrong result.

Finally, here’s a plan I got when I hinted query correctly to force the nested loop join with test_dwf_sapfi as the inner (second) table in the join (in other words I hinted the plan that had been giving me the wrong results):

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name                       | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                            |      1 |        |   405 (100)|       |       |      1 |00:00:00.01 |     154 |       |       |          |
|   1 |  SORT AGGREGATE           |                            |      1 |      1 |            |       |       |      1 |00:00:00.01 |     154 |       |       |          |
|   2 |   NESTED LOOPS            |                            |      1 |  38491 |   405   (1)|       |       |  15197 |00:00:00.01 |     154 |       |       |          |
|   3 |    SORT UNIQUE            |                            |      1 |    152 |    13   (0)|       |       |     60 |00:00:00.01 |      25 |  4096 |  4096 | 4096  (0)|
|   4 |     PARTITION LIST SINGLE |                            |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|   5 |      TABLE ACCESS FULL    | TEST_SAPFI_COICAR_AT5DAT11 |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|   6 |    PARTITION LIST ITERATOR|                            |     60 |    253 |     5   (0)|   KEY |   KEY |  15197 |00:00:00.01 |     129 |       |       |          |
|*  7 |     TABLE ACCESS FULL     | TEST_DWF_SAPFI             |     60 |    253 |     5   (0)|   KEY |   KEY |  15197 |00:00:00.01 |     129 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - filter("DWF"."DATUMUCETNIPOM_CODE"=TO_NUMBER(TO_CHAR(INTERNAL_FUNCTION("COI"."DATUCT"),'yyyymmdd')))

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   1 -  SEL$5DA710D3
           -  leading(@sel$5da710d3 coi@sel$2 dwf@sel$1)

   7 -  SEL$5DA710D3 / DWF@SEL$1
           -  use_nl(@sel$5da710d3 dwf@sel$1)

Compare this with the plan I got by using the wrong hint, resulting in the adaptive plan, but with Bloom filter pruning disable:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name                       | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |                            |      1 |        |   329 (100)|       |       |      1 |00:00:00.05 |     154 |       |       |          |
|   1 |  SORT AGGREGATE           |                            |      1 |      1 |            |       |       |      1 |00:00:00.05 |     154 |       |       |          |
|   2 |   NESTED LOOPS            |                            |      1 |  38491 |   329   (1)|       |       |      8 |00:00:00.05 |     154 |       |       |          |
|   3 |    SORT UNIQUE            |                            |      1 |    152 |    13   (0)|       |       |     60 |00:00:00.01 |      25 |  4096 |  4096 | 4096  (0)|
|   4 |     PARTITION LIST SINGLE |                            |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|   5 |      TABLE ACCESS FULL    | TEST_SAPFI_COICAR_AT5DAT11 |      1 |    152 |    13   (0)|     2 |     2 |  15197 |00:00:00.01 |      25 |       |       |          |
|   6 |    PARTITION LIST ITERATOR|                            |     60 |    253 |   314   (1)|   KEY |   KEY |      8 |00:00:00.05 |     129 |       |       |          |
|*  7 |     TABLE ACCESS FULL     | TEST_DWF_SAPFI             |     60 |    253 |   314   (1)|   KEY |   KEY |      8 |00:00:00.05 |     129 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - filter("DWF"."DATUMUCETNIPOM_CODE"=TO_NUMBER(TO_CHAR(INTERNAL_FUNCTION("COI"."DATUCT"),'yyyymmdd')))

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2 (U - Unused (1))
---------------------------------------------------------------------------
   0 -  STATEMENT
           -  opt_param('_bloom_pruning_enabled','false')

   5 -  SEL$5DA710D3 / COI@SEL$2
         U -  use_nl (coi)

It’s the same plan (with the same plan hash value though I haven’t shown that) – it has the same predicates, and does the same amount of work, But when the optimizer gets to this plan through the adaptive pathway the run-time engine produces the wrong results (note A-Rows = 8 at operation 2), while if the plan is forced by a correct set of hints the run-time engine produces the right path.

As you might guess, another way to bypass the problem was to disable adaptive plans – but when I did that the only way to get the nested loop path was through correct hinting anyway.

Test it yourself

Here’s a script to create the test data:

rem
rem     Script:         bloom_bug_02.sql
rem     Author:         Michal Telensky / Jonathan Lewis
rem     Dated:          Feb 2021
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem
rem     See also:
rem     https://community.oracle.com/tech/developers/discussion/4480469/reproducible-testcase-for-wrong-results
rem     https://livesql.oracle.com/apex/livesql/s/jzc2uyw6ecf2z2ul35nyrxelv
rem

drop table test_dwf_sapfi;
drop table test_sapfi_coicar_at5dat11;
purge recyclebin;

--
-- Don't do this unless it's a private system
-- Many sites seem to have the defaults anyway
--

execute dbms_stats.delete_system_stats

create table test_sapfi_coicar_at5dat11(
        datuct date,
        datumzprac number(8,0)
 ) 
row store compress advanced 
partition by list (datumzprac) (
        partition p20000101 values (20000101)
)
;

alter table test_sapfi_coicar_at5dat11 add partition p20200414 values (20200414);

insert /*+ append */ into test_sapfi_coicar_at5dat11
select date'2019-11-20' datuct, 20200414 datumzprac from dual connect by level <   2 union all
select date'2019-12-20' datuct, 20200414 datumzprac from dual connect by level <   2 union all
select date'2019-12-29' datuct, 20200414 datumzprac from dual connect by level <   4 union all
select date'2020-01-01' datuct, 20200414 datumzprac from dual connect by level <  55 union all
select date'2020-01-08' datuct, 20200414 datumzprac from dual connect by level <   3 union all
select date'2020-01-13' datuct, 20200414 datumzprac from dual connect by level <   8 union all
select date'2020-01-14' datuct, 20200414 datumzprac from dual connect by level <  117 union all
select date'2020-01-15' datuct, 20200414 datumzprac from dual connect by level <  65 union all
select date'2020-01-30' datuct, 20200414 datumzprac from dual connect by level <   2 union all
select date'2020-01-31' datuct, 20200414 datumzprac from dual connect by level <  12 union all
select date'2020-02-01' datuct, 20200414 datumzprac from dual connect by level <  20 union all
select date'2020-02-05' datuct, 20200414 datumzprac from dual connect by level <   4 union all
select date'2020-02-10' datuct, 20200414 datumzprac from dual connect by level <   5 union all
select date'2020-02-12' datuct, 20200414 datumzprac from dual connect by level <   2 union all
select date'2020-02-17' datuct, 20200414 datumzprac from dual connect by level <   2 union all
select date'2020-02-21' datuct, 20200414 datumzprac from dual connect by level <   16 union all
select date'2020-02-29' datuct, 20200414 datumzprac from dual connect by level <   37 union all
select date'2020-03-01' datuct, 20200414 datumzprac from dual connect by level < 1851 union all
select date'2020-03-02' datuct, 20200414 datumzprac from dual connect by level <  227 union all
select date'2020-03-03' datuct, 20200414 datumzprac from dual connect by level <   75 union all
select date'2020-03-04' datuct, 20200414 datumzprac from dual connect by level <   19 union all
select date'2020-03-05' datuct, 20200414 datumzprac from dual connect by level <  107 union all
select date'2020-03-06' datuct, 20200414 datumzprac from dual connect by level <  163 union all
select date'2020-03-07' datuct, 20200414 datumzprac from dual connect by level <   72 union all
select date'2020-03-08' datuct, 20200414 datumzprac from dual connect by level <   78 union all
select date'2020-03-09' datuct, 20200414 datumzprac from dual connect by level <  187 union all
select date'2020-03-10' datuct, 20200414 datumzprac from dual connect by level <  124 union all
select date'2020-03-11' datuct, 20200414 datumzprac from dual connect by level <   92 union all
select date'2020-03-12' datuct, 20200414 datumzprac from dual connect by level <  137 union all
select date'2020-03-13' datuct, 20200414 datumzprac from dual connect by level <  397 union all
select date'2020-03-14' datuct, 20200414 datumzprac from dual connect by level <   52 union all
select date'2020-03-15' datuct, 20200414 datumzprac from dual connect by level <   16 union all
select date'2020-03-16' datuct, 20200414 datumzprac from dual connect by level <  622 union all
select date'2020-03-17' datuct, 20200414 datumzprac from dual connect by level <  215 union all
select date'2020-03-18' datuct, 20200414 datumzprac from dual connect by level <  299 union all
select date'2020-03-19' datuct, 20200414 datumzprac from dual connect by level <  265 union all
select date'2020-03-20' datuct, 20200414 datumzprac from dual connect by level <  627 union all
select date'2020-03-21' datuct, 20200414 datumzprac from dual connect by level <   52 union all
select date'2020-03-22' datuct, 20200414 datumzprac from dual connect by level <   60 union all
select date'2020-03-23' datuct, 20200414 datumzprac from dual connect by level <  168 union all
select date'2020-03-24' datuct, 20200414 datumzprac from dual connect by level <  255 union all
select date'2020-03-25' datuct, 20200414 datumzprac from dual connect by level <  185 union all
select date'2020-03-26' datuct, 20200414 datumzprac from dual connect by level <  240 union all
select date'2020-03-27' datuct, 20200414 datumzprac from dual connect by level <  663 union all
select date'2020-03-28' datuct, 20200414 datumzprac from dual connect by level <   88 union all
select date'2020-03-29' datuct, 20200414 datumzprac from dual connect by level <  771 union all
select date'2020-03-30' datuct, 20200414 datumzprac from dual connect by level <  328 union all
select date'2020-03-31' datuct, 20200414 datumzprac from dual connect by level < 1675 union all
select date'2020-04-01' datuct, 20200414 datumzprac from dual connect by level <  641 union all
select date'2020-04-02' datuct, 20200414 datumzprac from dual connect by level <  251 union all
select date'2020-04-03' datuct, 20200414 datumzprac from dual connect by level <   84 union all
select date'2020-04-06' datuct, 20200414 datumzprac from dual connect by level <  325 union all
select date'2020-04-07' datuct, 20200414 datumzprac from dual connect by level <  366 union all
select date'2020-04-08' datuct, 20200414 datumzprac from dual connect by level <  459 union all
select date'2020-04-09' datuct, 20200414 datumzprac from dual connect by level < 2470 union all
select date'2020-04-10' datuct, 20200414 datumzprac from dual connect by level <   16 union all
select date'2020-04-11' datuct, 20200414 datumzprac from dual connect by level <   16 union all
select date'2020-04-12' datuct, 20200414 datumzprac from dual connect by level <   24 union all
select date'2020-04-13' datuct, 20200414 datumzprac from dual connect by level <  130 union all
select date'2020-04-14' datuct, 20200414 datumzprac from dual connect by level <    9  -- > change this value and the final (wrong) result changes in synch
/

commit
/

--
-- There are no indexes, so this method_opt collects fewer stats than expected
-- No column stats on the partition(s), only partition row and block stats
-- It does get basic column stats at the table level.
--

declare
        schema_name varchar2(128);
begin
        select sys_context('userenv', 'current_schema') into schema_name from dual;

        dbms_stats.gather_table_stats(
                ownname          => schema_name,
                tabname          => 'test_sapfi_coicar_at5dat11',
                partname         => 'p20200414',
                estimate_percent => dbms_stats.auto_sample_size,
                method_opt       => 'for all indexed columns size auto'
        );
end;
/

create table test_dwf_sapfi (
        datumucetnipom_code number(8,0) not null enable
) 
row store compress advanced 
partition by list (datumucetnipom_code) (
        partition p20000101 values (20000101) 
)
/

begin
        for i in (
                select  distinct to_char(datuct, 'yyyymmdd') datumucetnipom_code 
                from    test_sapfi_coicar_at5dat11 
                order by 
                        1
        ) loop
                execute immediate 
                        'alter table test_dwf_sapfi add partition p' || 
                                i.datumucetnipom_code || 
                                ' values (' || i.datumucetnipom_code || ')'
                ;
        end loop;
end;
/


insert  /*+ append */ into test_dwf_sapfi 
select  to_number(to_char(datuct, 'yyyymmdd')) 
from    test_sapfi_coicar_at5dat11 
where   datumzprac = 20200414
;

commit;

--
--  The problems (seem to) go away if you collect stats
--

-- execute dbms_stats.gather_table_stats(user,'test_dwf_sapfi',granularity=>'global')


set serveroutput off
set linesize 255
set pagesize 60
set trimspool on

alter session set statistics_level='all';

prompt  ===================================
prompt  plan with incorrect use_hash() hint
prompt  ===================================

select 
        /*  use_hash(dwf) */ 
        count(*) count_hash 
from 
        test_dwf_sapfi  dwf
where
         exists (
                select  1 
                from    test_sapfi_coicar_at5dat11      coi
                where   coi.datumzprac = 20200414
                and     to_char(coi.datuct,'yyyymmdd') = dwf.datumucetnipom_code
        );

select * from table(dbms_xplan.display_cursor(format=>'cost outline allstats last partition hint_report adaptive'));

set serveroutput on
spool off

Update (Sept 2021)

The wrong results still appear in 21.3.0.0

January 20, 2021

CBO Example

Filed under: CBO,Execution plans,Oracle,Statistics — Jonathan Lewis @ 10:01 am GMT Jan 20,2021

A little case study based on an example just in on the Oracle-L list server. This was supplied with a complete, working, test case that was small enough to understand and explain very quickly.

The user created a table, and used calls to dbms_stats to fake some statistics into place. Here, with a little cosmetic editing, is the code they supplied.

set serveroutput off
set linesize 180
set pagesize 60
set trimspool on

drop table t1 purge;

create table t1 (id number(20), v varchar2(20 char));
create unique index pk_id on t1(id);
alter table t1 add (constraint pk_id primary key (id) using index pk_id enable validate);
exec dbms_stats.gather_table_stats(user, 't1');
 
declare
        srec               dbms_stats.statrec;
        numvals            dbms_stats.numarray;
        charvals           dbms_stats.chararray;
begin
  
        dbms_stats.set_table_stats(
                ownname => user, tabname => 't1', numrows => 45262481, numblks => 1938304, avgrlen => 206
        );

        numvals := dbms_stats.numarray (1, 45262481);
        srec.epc:=2;
        dbms_stats.prepare_column_values (srec, numvals);
        dbms_stats.set_column_stats (
                ownname => user, tabname => 't1', colname => 'id', 
                distcnt => 45262481, density => 1/45262481,
                nullcnt => 0, srec => srec, avgclen => 6
        );

        charvals := dbms_stats.chararray ('', '');
        srec.epc:=2;
        dbms_stats.prepare_column_values (srec, charvals);
        dbms_stats.set_column_stats(
                ownname => user, tabname => 't1', colname => 'v', 
                distcnt => 0,  density => 0, 
                nullcnt => 45262481, srec => srec, avgclen => 0
        );
        dbms_stats.set_index_stats( 
                ownname => user, indname =>'pk_id', numrows => 45607914, numlblks => 101513,
                numdist => 45607914, avglblk => 1, avgdblk => 1, clstfct => 33678879, indlevel => 2
        );
end;
/
 
variable n1 nvarchar2(32)
variable n2 number

begin
        :n1 := 'D';
        :n2 := 50;
end;
/
 

select 
        /*+ gather_plan_statistics */ 
        * 
from    ( 
        select  a.id col0,a.id col1
        from    t1 a
        where   a.v = :n1 
        and     a.id > 1
        order by 
                a.id 
        ) 
where 
        rownum <= :n2 
;
 
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost peeked_binds '));

From Oracle’s perspective the table has 45M rows, with a unique sequential key starting at 1 in the id column. The query looks like a pagination query, asking for 50 rows, ordered by id. But the in-line view asks for rows where id > 1 (which, initiall, means all of them) and applies a filter on the v column.

Of course we know that v is always null, so in theory the predicate a.v = :n1 is always going to return false (or null, but not true) – so the query will never return any data. However, if you read the code carefully you’ll notice that the bind variable v has been declared as an nvarchar2() not a varchar2().

Here’s the exection plan I got on an instance running 19.3 – and it’s very similar to the plan supplied by the OP:

----------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |  3747 (100)|      0 |00:00:00.01 |
|*  1 |  COUNT STOPKEY                |       |      1 |        |            |      0 |00:00:00.01 |
|   2 |   VIEW                        |       |      1 |     50 |  3747   (1)|      0 |00:00:00.01 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| T1    |      1 |    452K|  3747   (1)|      0 |00:00:00.01 |
|*  4 |     INDEX RANGE SCAN          | PK_ID |      0 |   5000 |    14   (0)|      0 |00:00:00.01 |
----------------------------------------------------------------------------------------------------

Peeked Binds (identified by position):
--------------------------------------
   2 - :2 (NUMBER): 50

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=:N2)
   3 - filter(SYS_OP_C2C("A"."V")=:N1)
   4 - access("A"."ID">1)

The question we were asked was this: “Why does the optimizer estimate that it will return 5,000 entries from the index range scan at operation4?”

The answer is the result of combining two observations.

First: In the Predicate Information you can see that Oracle has applied a character-set conversion to the original predicate “a.v = :n1” to produce filter(SYS_OP_C2C(“A”.”V”)=:N1). The selectivity of “function of something = bind value” is one of those cases where Oracle uses one of its guesses, in this case 1%. Note that the E-rows estimate for operation 3 (table access) is 452K, which is 1% of the 45M rows in the table.

In real life if you had optimizer_dynamic_sampling set at level 3, or had added the hint /*+ dynamic_sampling(3) */ to the query, Oracle would sample some rows to avoid the need for guessing at this point.

Secondly: the optimizer has peeked the bind variable for the rownum predicate, so it is optimizing for 50 rows (basically doing the arithmetic of first_rows(50) optimisation). The optimizer “knows” that the filter predicate at the table will eliminate all but 1% of the rows acquired, and it “knows” that it has to do enough work to find 50 rows in total – so it can calculate that (statistically speaking) it has to walk through 5,000 (= 50 * 100) index entries to visit enough rows in the table to end up with 50 rows.

Next Steps (left as exercise)

Once you’ve got the answer to the question “Why is this number 5,000?”, you might go back and point out that the estimate for the table access was 95 times larger than the estimate for the number of rowids selected from the index and wonder how that could be possible. (Answer: that’s just one of the little defects in the code for first_rows(n).)

You might also wonder what would have happened in this model if the bind variable n1 had been declared as a varchar2() rather than an nvarchar2() – and that might have taken you on to ask yet another question about what the optimizer was playing at.

Once you’ve modelled something that is a little puzzle there’s always scope for pushing the model a little further and learning a little bit more before you file the model away for testing on the next version of Oracle.

November 11, 2020

Indexing partitions

Filed under: Indexing,Oracle,Partitioning,Statistics — Jonathan Lewis @ 11:35 am GMT Nov 11,2020

This is one of those notes on the “thinking about the data / indexes” theme that I like to write; it’s a draft I wrote over two and a half years ago that I’ve just rediscovered and polished slightly and refers to a “recent” question that came up on the ODC Forum. It’s not intended as “The Answer” to the question, it’s a just a list of ideas and observations you’d want to think about if you had to address the problem:

There is a table t1 that is range partitioned by date, each partition represents one day, and the partitioning column c3_dt is constrained to be “date-only” (i.e. the time component is always 00:00:00) so there’s only one value per partition for the column. The table is a transaction table, and currently a single partition holds about 200 million rows. [Ed: And there seem to be a little over 200 partitions, with some variations in rows per partition.]

There are two local indexes on the table: idx1 is a unique index defined on (c1, c2, c3_dt) and idx2 is non-unique, defined on (c1, c2).

Could we drop one of these two indexes, and if so which one ?

There are three immediate considerations here:

  • One index is a prefix (i.e. same columns in the same order) of the other so, from a purely mechanical perspective, we could drop the shorter index. We would, however, have to be a little cautious because the optimizer might not want to use the longer index in places where it would use the shorter because (a) the index would physically be larger and (b) in the general case the clustering_factor of the shorter index is likely to be smaller than the clustering_factor of a longer index with a matching prefix, and the estimated cost of using an index is based largely on the leaf_block count and the clustering_factor.
  • The c3_dt column adds no value in terms of precision for accessing the table; any predicate that allows the optimizer to apply partition elimination will have the same effect whether or not the column is part of the index. The only queries that could have better performance because of the extra column would be queries involving only the columns in the index – even then you’d have to ensure that at least one column in the index was declared not null (and that might have been the c3_dt column). On the down-side the date column is adding 8 bytes per row (7 data bytes, 1 length byte) to the size of the index. So it would be nicer to drop the longer index if possible. On the up-side, since the date column is always the same value within a partition its presence at the end of the index should mean that the clustering_factor of idx1 should match that of idx2 (with, perhaps, some small variation due to random sampling).
  • You can’t drop the unique index unless it has been created “by accident”. It seems reasonably likely that the index is supporting a unique (or primary key) constraint. For a local index to be unique (or, more accurately, for a unique/primary key constraint to be supported by a local index) the constraint/index has to contain the partition key, and c3_dt is the partitioning column

Bottom line, then, is that you (probably) can’t drop idx1 because you need the uniqueness that comes with the c3_dt column but it should be possible (and would be sensible and probably fairly safe) to drop idx2 – the two-column index. Of course it would be sensible to make the index invisible for a while to see if any plans that were using idx2 refused to use idx1 instead; and it’s probably a good idea to create (or be prepared to create) column group statistics on the (c1, c2) combination so that the optimizer doesn’t lose the information about the number of distinct combinations.

It’s a pity that the index that has to be kept is the one that has a “useless” column at the end, adding 8 bytes per row (totalling roughly 1.6GB per partition) to the size of the index – however the fact that the columns holds the same value for every row in the partition offers an interesting possibility. If you re-arrange the column order to (c3_dt, c1, c2) and compress the index on the first column, would all the queries that currently use the index still be happy to use it with an “index skip scan”? Before pursuing this idea, of course, you would want to check that the leading c1 column wasn’t there to act protect a foreign key constraint from the “foreign key locking” problem.

The OP supplied some statistics about the three columns, which prompt a few more thoughts:

 
Partition level stats:- 
column_name data_type num_distinct                density
----------- --------- ------------   --------------------
C1          VARCHAR2          8754    0.00127388535031847 
c2          NUMBER         1398016    7.1529939571507E-7 
C3_dt       DATE                 1    2.43469886548297E-9 

Global statistics:-  
column_name data_type num_distinct                density 
----------- --------- ------------   --------------------
c1          VARCHAR2       1597649    6.25919710775020E-7 
c2          NUMBER         1996800    5.00801282051282E-7 
C3_dt       DATE               211    0.004739336492891 

Looking at these numbers we can  make the following observations:

  • C3_DT: global num_distinct = 211 suggests 211 partitions and, at the global level, we see 1/num_distinct = density so there’s no global histogram. But at the partition level we see the density = 2.4e-9, which suggests there’s a (silly) histogram and there are 205 million rows in the partition (reverse engineering the density = 1/(2*num_rows) formula for “non-existent values). We can’t be sure without doing some experimentation what effect this inconsistency might have on the optimizer’s choice of access path when we have to handle range-based (cross-partition) queries.
  • C1: num_distinct = 8754 for the current partition and 1,597,549 for the table, and the global figure happens to be very close to 1,847,094 ( = 211 * 8754): so perhaps there’s very little (or NO) overlap in c1 values between partitions – and that’s probably going to mess up the optimizer’s estimates for queries that cross partitions anyway. It’s also an indication that whichever index we keep could be complressed on the first column, and if we move to an index (c3_dt, c1, c2) the index could be compressed on the first two columns.
  • C2: has similar values for num_distinct for both the partition and the global level, which is nice – any anomalies introduced by variation between partitions is likely to be relatively small and may be nearly harmless. I often say that the optimizer likes boring data and can do a good job with it – c2 looks like it might be a nice boring column; though it’s possible that it’s hiding a nasty secret in the low / high values that we haven’t been told.

As I said at the outset, I’m not trying to answer the question “which index should I drop?”, I’m supplying some more questions that need answers before the original question can be answered sensibly. Along the way I’ve also demonstrated the way in which we can extract further information by looking at a few numbers and doing a little arithmetic.

I’d also like to point out that it’s a draft from 30 months ago, and I haven’t spent a lot of time trying to complete it and polish it, so E and O.E. but do feel free to chip in with other points that you think should be considered when making this decision). 

 

October 21, 2020

Column Groups

Filed under: Column groups,extended stats,Oracle,Statistics — Jonathan Lewis @ 12:14 pm BST Oct 21,2020

Here’s an odd little detail about the statistics of column groups. At first glance it’s counter-intuitive but it’s actually an “obvious” (once you’ve thought about it for a bit) consequence of the approximate_ndv() algorithm for gathering stats.

I’ll present it as a question:

I have a table with two columns: flag and v1. Although the column are not declared as non-null neither holds any nulls. If there are 26 distinct values for flag, and 1,000,000 distinct values for v1, what’s the smallest number of distinct values I should see if I create the column group (flag, v1) ?

The question is, of course, a little ambiguous – there’s the number of distinct values that the column (group) holds and the number that a fresh gather of statistics reports it as holding. Here are the stats from a test run of a simple script that creates, populates and gathers stats on my table:

select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/

COLUMN_NAME                      NUM_DISTINCT
-------------------------------- ------------
FLAG                                       26
ID                                    1000000
V1                                     999040
SYS_STUQ#TO6BT1REX3P1BKO0ULVR9         989120

There are actually 1,000,000 distinct values for v1 (it’s a varchar2() representation of the id column), but the approximate_ndv() mechanism can have an error of (I believe) up to roughly 1.3%, so Oracle’s estimate here is a little bit off.

The column group (represented by the internal column defonition SYS_STUQ#TO6BT1REX3P1BKO0ULVR9) must hold (at least) 1,000,000 distinct values – but the error in this case is a little larger than the error in v1, with the effect that the number of combinations appears to be less than the number of distinct values for v1!

There’s not much difference in this case between actual and estimate, but there test demonstrates the potential for a significant difference between the estimate and the arithmetic that Oracle would do if the column group didn’t exist. Nominally the optimizer would assume there were 26 million distinct values (though in this case I had only created 1M rows in the table and the optimizer would sanity check that 26M).

So, although the difference between actual and estimate is small, we have to ask the question – are there any cases where the optimizer will ignore the column group stats because of a sanity check that “proves” the estimate is “wrong” – after all it must be wrong if the num_distinct is less than the num_distinct of one of the components. Then again maybe there’s a sanity check that only ignores the column group if the estimate is “wrong enough”, but allows for small variations.

I mention this only because an odd optimizer estimate has shown up recently on the Oracle-L mailing list, and the only significant difference I can see (at present) is that a bad plan appears for a partition where this column group anomaly shows up in the stats, but a good plan appears when the column group anomaly isn’t present.

Footnote:

If you want to recreate the results above, here’s the model I’ve used (tested on 19.3.0.0 and 11.2.0.4):

rem
rem     Script:         column_group_stats_5.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem             11.2.0.4
rem 

execute dbms_random.seed(0)

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        chr(65 + mod(rownum,26))        flag,
        rownum                          id,
        lpad(rownum,10,'0')             v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6   -- > comment to avoid WordPress format issue
order by
        dbms_random.value
/


select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns(v1, flag) size 1'
        );
end;
/
 
select  column_name, num_distinct 
from    user_tab_cols 
where   table_name = 'T1'
/

Footnote 2:

As an interesting little statistical quirk, if I defined the column group as (flag, v1) rather than (v1, flag) the estimate for the column group num_distinct was 1,000,000.

October 6, 2020

Index FFS Cost 2

Filed under: Bugs,CBO,Indexing,Oracle,Statistics — Jonathan Lewis @ 12:58 pm BST Oct 6,2020

Here’s a little puzzle, highlighting a “bug that’s not a bug” that was “fixed but not fixed” some time in the 10.2 timeline. (If you want specifics about exactly when the fix became available and what patches might be available they’re in MOS – Bug 5099019 : DBMS_STATS DOESN’T COUNT LEAF_BLOCKS CORRECTLY.

Running 19.3.0.0, with the system statistics as shown:


begin
        dbms_stats.set_system_stats('MBRC',16);
        dbms_stats.set_system_stats('MREADTIM',10);
        dbms_stats.set_system_stats('SREADTIM',5);
        dbms_stats.set_system_stats('CPUSPEED',500);
end;

begin
        dbms_stats.gather_table_stats(
                ownname          =&gt; user,
                tabname          =&gt; 'T1',
                cascade          =&gt; true,
                estimate_percent =&gt; 100,
                method_opt       =&gt; 'for all columns size 1'
        );
end;
/

select
        index_name, leaf_blocks
from
        user_indexes
where
        table_name = 'T1'
;

alter system flush buffer_cache;

set autotrace traceonly 

select
        count(small_vc)
from    t1
where   small_vc is not null
;

set autotrace off

For my test run there are no uncommitted transactions, the gathering of table stats with cascade to indexes means all the blocks are clean so there’s no undo activity needed to produce an answer for the final query, and all that the query is going to do is run an index fast full scan to count the number of rows in the table because there’s an index on just (small_vc).

The system stats tell us that Oracle is going to cost the index fast full scan by taking the leaf block count, dividing by 16 (the MBRC) and multiplying by 2 ( mreadtim / sreadtim) and adding a bit for CPU usage. So here are the results from running the query and generating the plan with autotrace:


PL/SQL procedure successfully completed.

INDEX_NAME           LEAF_BLOCKS
-------------------- -----------
T1_V1                        153

1 row selected.

System altered.

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 274579903

-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |    22   (5)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_V1 |  9999 |   107K|    22   (5)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("SMALL_VC" IS NOT NULL)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1530  consistent gets
       1522  physical reads
          0  redo size
        558  bytes sent via SQL*Net to client
        425  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Points to notice:

  • The cost is (as predicted):  ceiling(153 leaf_blocks / 16 MBRC ) * 2 + a bit:  = 20 + bit.
  • The number of physical blocks read is 1522, not the 153 leaf blocks reported in index stats
  • No recursive SQL, and I did say that there were no undo / read consistency issues to worry about

There is clearly an inconsistency between the size of the index and the (100% gathered) leaf_block count that the optimizer is using, and this is a point I made many years ago in “Cost Based Oracle – Fundamentals”.  To gather the leaf block count Oracle has looked at the number of leaf blocks that, after cleanout. hold any index entries – and here’s how I prepared this demo to confuse the issue:


rem
rem     Script:         index_ffs_cost.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2012
rem

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level &lt;= 1e4 -- &gt; comment to avoid wordpress format issue
)
select
        rownum                  id,
        lpad(rownum,10,'0')     small_vc,
        rpad('x',100)           padding
from
        generator       v1,
        generator       v2
where
        rownum &lt;= 1e5 -- &gt; comment to avoid wordpress format issue
;

create index t1_v1 on t1(small_vc) pctfree 80;

delete
        from t1
where
        id between 5000 and 95000
;

commit;

When I gathered stats on the index I had just deleted 90% of the data from the table – so 90% of the leaf blocks in the index were empty but still in the index structure, and only 10% of the leaf blocks held any current index entries. This is why Oracle reported leaf_blocks = 153 while the index fast full scan had to read through nearly 1530 blocks.

This is one of those contradictory problems – for an index fast full scan you need to know the size of the segment that will be scanned because that’s the number of blocks you will have to examine; but in most cases the number of populated index leaf blocks is the number you need to know about when you’re trying to cost an index range scan. Of course in most cases the nature of Oracle’s implementation of B-tree indexes will mean that the two counts will be within one or two percent of each other. But there are a few extreme cases where you could get an index into a state where the segment size is large and the data set is small and you don’t want the optimizer to think that an index fast full scan will be low-cost.

Oracle produced a mechanism for getting the segment block count captured as the leaf_blocks statistic late in 10.2, but it’s not implemented by default, and it’s not something you can tweak into a query with the opt_param() hint. Fix control 509019 has the description: “set leaf blocks to the number of blocks in the index extent map”, but it’s not a run-time fix, it’s a fix that has to be in place when you gather the  index stats – thus:

alter session set "_fix_control"='5099019:1';

begin
        dbms_stats.gather_table_stats(
                ownname          =&gt; user,
                tabname          =&gt;'T1',
                cascade          =&gt; true,
                estimate_percent =&gt; 100,
                method_opt       =&gt; 'for all columns size 1'
        );
end;
/

select
        index_name, leaf_blocks
from
        user_indexes
where
        table_name = 'T1'
;

alter system flush buffer_cache;

set autotrace traceonly 

select
        count(small_vc)
from    t1
where   small_vc is not null
;

set autotrace off

===================================================

Session altered.

PL/SQL procedure successfully completed.

INDEX_NAME           LEAF_BLOCKS
-------------------- -----------
T1_V1                       1555

1 row selected.

System altered.

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 274579903

-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |   201   (3)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_V1 |  9999 |   107K|   201   (3)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("SMALL_VC" IS NOT NULL)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1530  consistent gets
       1522  physical reads
          0  redo size
        558  bytes sent via SQL*Net to client
        439  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Session altered.

As you can see Oracle has now set leaf_blocks = 1555. This is actually the number of blocks below the highwater mark in the segment. A further check with the dbms_space and dbms_space_usage packages showed that the number of blocks in the index structure was actually 1,523 with a further 32 blocks that we could infer were space management blocks. Of the 1,523 blocks in the index structure 157 were reported as “FULL” while 1364 were reported as FS2 which possibly ought to mean  “available for re-use” (though still in the structure), although this didn’t quite seem to be the case a few years ago.

Although Oracle has supplied a fix to a problem I highlighted in CBO-F, I can understand why it’s not enabled by default, and I don’t think I’d want to take advantage of it in a production system given the way it’s a session setting at stats gathering time. The number of times it likely to matter I’d probably add hints to the SQL to stop the optimizer from using the index incorrectly, or do something at a stats-gathering moment to call dbms_stats.set_index_stats().

 

July 9, 2020

Execution Plans

Filed under: Execution plans,extended stats,Histograms,Oracle,Performance,Problem Solving,Statistics,Troubleshooting — Jonathan Lewis @ 4:54 pm BST Jul 9,2020

This is an example from the Oracle Developer Community of using the output of SQL Monitor to detect a problem with object statistics that resulted in an extremely poor choice of execution plan.

A short time after posting the original statement of the problem the OP identified where he thought the problem was and the general principle of why he thought he had a problem – so I didn’t have to read the entire execution plan to work out a strategy that would be (at least) a step in the right direction of solving the performance problem.

This note, then, is just a summary of the five minute that I spent confirming the OP’s hypothesis and explaining how to work around the problem he had identified. It does, however, give a little lead-in to the comments I made to the OP in order to give a more rounded picture of what his execution plan wass telling us.

So here’s the top half of the plan (which is the first subquery of a “concatenation”) with the first few predicates:

===============================================================================================================================================================================================================================  
| Id |                 Operation                  |            Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Cell   |  Mem  | Activity |             Activity Detail             |  
|    |                                            |                            | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload | (Max) |   (%)    |               (# samples)               |  
===============================================================================================================================================================================================================================  
|  0 | SELECT STATEMENT                           |                            |         |      |     12086 |     +5 |     1 |     864K |      |       |         |       |     0.01 | Cpu (1)                                 |  
|  1 |   CONCATENATION                            |                            |         |      |     12086 |     +5 |     1 |     864K |      |       |         |       |          |                                         |  
|  2 |    FILTER                                  |                            |         |      |     12191 |     +4 |     1 |     864K |      |       |         |       |     0.03 | Cpu (4)                                 |  
|  3 |     FILTER                                 |                            |         |      |     12191 |     +4 |     1 |      26M |      |       |         |       |     0.01 | Cpu (1)                                 |  
|  4 |      NESTED LOOPS                          |                            |     241 | 251K |     12191 |     +4 |     1 |      26M |      |       |         |       |     0.02 | Cpu (3)                                 |  
|  5 |       NESTED LOOPS                         |                            |     241 | 251K |     12191 |     +4 |     1 |      26M |      |       |         |       |     0.07 | Cpu (8)                                 |  
|  6 |        NESTED LOOPS                        |                            |     241 | 251K |     12232 |     +4 |     1 |      26M |      |       |         |       |     0.05 | Cpu (6)                                 |  
|  7 |         NESTED LOOPS                       |                            |    5407 | 233K |     12242 |     +4 |     1 |      86M |      |       |         |       |          |                                         |  
|  8 |          MERGE JOIN CARTESIAN              |                            |       1 |   35 |     12242 |     +4 |     1 |     1000 |      |       |         |       |          |                                         |  
|  9 |           TABLE ACCESS BY INDEX ROWID      | REF1                       |       1 |    3 |         1 |     +4 |     1 |        1 |      |       |         |       |          |                                         |  
| 10 |            INDEX RANGE SCAN                | REF1_PK                    |       1 |    2 |     12242 |     +4 |     1 |        1 |      |       |         |       |          |                                         |  
| 11 |           BUFFER SORT                      |                            |      84 |   32 |     12242 |     +4 |     1 |     1000 |      |       |         |  104K |          |                                         |  
| 12 |            TABLE ACCESS BY INDEX ROWID     | STAGE                      |      84 |   32 |         1 |     +4 |     1 |     1000 |      |       |         |       |          |                                         |  
| 13 |             INDEX RANGE SCAN               | STAGE_IDX1                 |      84 |    4 |         1 |     +4 |     1 |     1000 |      |       |         |       |          |                                         |  
| 14 |          PARTITION RANGE ITERATOR          |                            |    8292 | 232K |     12232 |     +4 |  1000 |      86M |      |       |         |       |          |                                         |  
| 15 |           TABLE ACCESS STORAGE FULL        | TAB1                       |    8292 | 232K |     12245 |     +1 |  1000 |      86M | 103M | 521GB |   1.96% |    7M |    51.81 | gc buffer busy acquire (1)              |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | latch: cache buffers chains (1)         |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | Cpu (1196)                              |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | gcs drm freeze in enter server mode (2) |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | reliable message (5)                    |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | cell single block physical read (2827)  |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | cell smart table scan (1977)            |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | read by other session (304)             |  
| 16 |         PARTITION RANGE ITERATOR           |                            |       1 |   12 |     12191 |     +4 |   86M |      26M |      |       |         |       |     0.42 | Cpu (51)                                |  
| 17 |          TABLE ACCESS BY LOCAL INDEX ROWID | TAB2                       |       1 |   12 |     12191 |     +4 |   86M |      26M |   4M |  28GB |         |       |    32.14 | gc cr grant 2-way (20)                  |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | gc cr request (2)                       |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | gc remaster (6)                         |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | Cpu (319)                               |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | gcs drm freeze in enter server mode (4) |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | latch: gc element (2)                   |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | cell single block physical read (3563)  |  
| 18 |           INDEX RANGE SCAN                 | TAB2_IX1                   |     166 |    3 |     12210 |     +2 |   86M |      26M |   1M |  11GB |         |       |    15.17 | Cpu (292)                               |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | cell single block physical read (1557)  |  
| 19 |        INDEX UNIQUE SCAN                   | MTD_PK                     |       1 |    1 |     12242 |     +4 |   26M |      26M |  292 |   2MB |         |       |     0.17 | Cpu (20)                                |  
|    |                                            |                            |         |      |           |        |       |          |      |       |         |       |          | cell single block physical read (1)     |  
| 20 |       TABLE ACCESS BY INDEX ROWID          | REF2                       |       1 |    2 |     12191 |     +4 |   26M |      26M |    7 | 57344 |         |       |     0.11 | Cpu (13)                                |  
| 21 |      TABLE ACCESS BY INDEX ROWID           | CONTROLTAB                 |       1 |    1 |         1 |     +4 |     1 |        1 |      |       |         |       |          |                                         |  
| 22 |       INDEX UNIQUE SCAN                    | CONTROLTAB_PK              |       1 |      |         1 |     +4 |     1 |        1 |      |       |         |       |          |                                         |  
| 23 |     MINUS                                  |                            |         |      |       102 |     +4 |    25 |        3 |      |       |         |       |          |                                         |  
| 24 |      TABLE ACCESS BY INDEX ROWID           | CUST_ORG_PK                |       1 |    3 |       942 |     +4 |    25 |       10 |      |       |         |       |          |                                         |  
| 25 |       INDEX UNIQUE SCAN                    | MC_PK                      |       1 |    2 |       942 |     +4 |    25 |       25 |      |       |         |       |          |                                         |  
| 26 |      SORT UNIQUE NOSORT                    |                            |       1 |    4 |         8 |     +4 |    25 |        9 |      |       |         |       |          |                                         |  
| 27 |       TABLE ACCESS BY INDEX ROWID          | REF1                       |       1 |    3 |         8 |     +4 |    25 |        9 |      |       |         |       |          |                                         |  
| 28 |        INDEX RANGE SCAN                    | REF1_PK                    |       1 |    2 |         8 |     +4 |    25 |        9 |      |       |         |       |          |                                         |  


Predicate Information (identified by operation id):  
---------------------------------------------------  
   2 - filter( EXISTS ( (SELECT /*+ INDEX_RS_ASC ("CUST_ORG_PK" "MC_PK") */ "CUST_ID" FROM "USER1"."CUST_ORG_PK"  "CUST_ORG_PK" 
               WHERE "CUST_ID"=:B1 AND "CUST_ORG_PK"."INDICATR"='Y') MINUS (SELECT /*+ INDEX_RS_ASC ("REF1" "REF1_PK") */ 
               TO_NUMBER("VAL") FROM "USER1"."REF1" "REF1" WHERE "PUSER"='ZZZ' AND "EDATE" .ge. TRUNC(SYSDATE@!) AND TO_NUMBER("VAL")=:B2  
               AND "SDATE" .le. TRUNC(SYSDATE@!))))  
   3 - filter( EXISTS (SELECT /*+ INDEX_RS_ASC ("CONTROLTAB" "CONTROLTAB_PK") */ 0 FROM  "USER2"."CONTROLTAB" "CONTROLTAB" WHERE
              "CONTROLTAB"."CNTRLID"=9999 AND  NVL("CONTROLTAB"."STATUS",'F')='S'))  
   9 - filter("REF1"."EDATE" .ge. TRUNC(SYSDATE@!))  
  10 - access("REF1"."PUSER"='XXX' AND  "REF1"."NAME"='CODE' AND "REF1"."SDATE" .le. TRUNC(SYSDATE@!))  
  13 - access("STAGE"."NAME"='XXX' AND "STAGE"."STAT"='I')  

Note: various inequality symbols changed to .le. / .ge. to avoid WordPress format issue.

The first thing to note is that the “Time (active)” shown at the top line is about 12,000 seconds – so it’s a long running query. Your next observation – before you look at the shape of the plan – might be to note that operations 15, 17 and 18 between them record thousands of seconds of time, mostly I/O time but including 1,200 seconds of CPU time. This might draw your eye to the part of the plan that tells you what you are doing at these operations and why you are doing it.

Looking at the detail – operation 15 is a full tablescan that is the child of a partition range iterator (operation 14), and that iterator is the second child of a nested loop join (operation 7). Why is the optimizer so out of balance that it thinks a table scan of (possibly) multiple partitions of a partitioned table is a good candidate for the second child of a nested loop?! The answer comes from the first child – the  Merge Join Cartesian at operation 8 has been given a cardinality estimate of just one row. When the cardinality estimate is one for the first table in a join then it doesn’t matter whether Oracle uses a nested loop join or a hash join, whatever happens next is only supposed to happen once.

Unfortunately when we compare “Rows (Estim)” and “Rows (Actual)” for the operation we see that the Merge Join Cartesian produced 1,000 rows – so the partition tablescan was going to happen 1,000 times (which we can confirm from the “Execs” column of operation 14). As a first step, then, we need to ensure that the optimizer gets a better estimate of the number of rows produced by the Cartesian join. So lets look at its child operations.

  • Operation 9 (Table Access of REF1) is predicted to return one row – with “Rows (actual)” = 1.
  • Operation 11 (Buffer Sort of data from STAGE1) is predicted to return 84 rows – with “Rows (actual)” = 1,000

Since the number of rows produced by a Cartesian join should be the product of the number of rows of the two inputs this tells us that the optimizer’s estimate of the number of rows from REF1 has been rounded up to 1 from a very small fraction (less than 1/84). If we can correct this then we may get Oracle to change the awful nested loop to an acceptable hash join. Wven if we can’t fix this mis-estimate we may be able to do something that improves the estimate for STAGE1 to something sufficienlty large that it will trigger the switch to a hash join. So let’s look at the predicates for these two tables.

REF1 predicates

   9 - filter("REF1"."EDATE">=TRUNC(SYSDATE@!))  
  10 - access("REF1"."PUSER"='XXX' AND  "REF1"."NAME"='CODE' AND "REF1"."SDATE"<=TRUNC(SYSDATE@!))  

The index range scan is based on an access predicate (with no filter predicate), so it looks like there’s a nearly perfect three-column index on REF1, but the optimizer can’t use the number of distinct keys in the index to get a good estimate of cardinality because one of the predicates is range-based. So the arithmetic will look at the three predicates separately and multiply up their selectivities. (It’s possible, of course, that this might be the first three columns of a 4, or more, column index.)

It’s a reasonable guess that the number of distinct combinations of (puser, name) will be much smaller than num_distinct(puser) * num_distinct(name) – so one strategy that might help increase the table’s cardinality estimate is to create extended statistics on the column group (puser, name).

Another reasonable guess is that the number of distinct values for the two columns is (relatively) small, with some skew to the distribution (name = ‘CODE’ looks particularly susceptible to being a commonly occurring value) – so perhaps we need a histogram on one or both of the columns (which would then require a histogram to be created on the column group as well if we wanted the optimizer to use the column group). We’d also have to make sure that the queried values didn’t fall outside the known low/high values for the columns if we wanted the column group to be used.

STAGE1 Predicates

13 - access("STAGE"."NAME"='XXX' AND "STAGE"."STAT"='I')

This is the access(-only) predicate for the index stage_idx1, and there are no filter predicates when we reach the table. So stage_idx1 might be a two-column index on the table that we are using completely, or it may be an index with more columns that we are using only incompletely. We can see that the cardinality estimate is out by a factor of 12 (84 predicted, 1,000 actual) so if this is the complete index (which would allow Oracle to use the distinct_keys value to estimate cardinality) there must be an uneven data distribution in the values; but if this is just the first two columns of a longer index then we might benefit from extended stats (viz: another column group) on this pair of columns.

Again, even if we create a column group, or take automatic advantage of the distinct_keys figure, the predicate STAT=’I’ (is that state, status?) looks a little dangerous – status columns tend to have a small number of distinct values with a signficant skew to the distribution of values – so we may want to create a histogram on the STAT column, which would then require us to create a histogram on the column group if we also wanted the column group stats to have an effect.

What Happened Next?

I made the suggestions about column groups and histogram to the OP – without the level of detail that I’ve given in the explanations above – and got the following response:

You are spot on.

There does exists frequency histogram on column NAME(having 14 distinct value) and STAT(having 7 distinct values) of table STAGE. Also there already exists a frequency histogram on column PUSER and height balanced histogram on column NAME of table REF1. But still they were not helping the existing query.

By creating a column group on both on ref1 (puser, name) and stage(name, stat) with histogram for both i.e. ‘FOR COLUMNS SIZE 254’. The merge Cartesian removed automatically from the path by optimizer and its doing a hash join now with TAB1 finishing the query in quick time.

Summary

When the cardinality (rows) estimate for an operation drops to one (which might mean much less than one and rounded up) then all hell can break loose and the choice of parent operation – and its cardinality estimate – might be insanely stupid, leading to a catastrophically poor execution plan.

Very low cardinality estimates are often the result of multiplying individual column selectivities to produce an estimated selectivity that is unrealistic (much too small) when compared with the actual number of distinct combinations that exist in the table. In such cases creating a column group, (possibly with a histogram) may be all you need to do to get a better cardinality estimate and a vastly improved execution plan.

 

April 10, 2020

raw timestamp

Filed under: Histograms,Oracle,Troubleshooting — Jonathan Lewis @ 6:59 pm BST Apr 10,2020

Many years ago I wrote a note with some comments about conveting the low_value and high_value columns of user_tab_columns to see what values they actually held. At the time the dbms_stats package held a few procedures to convert raw values and it was necessary to wrap these procedures in user-defined functions if you wanted a function to do the job.

In recent versions of Oracle (18c onwards) the package has finally acquired several functions to do a proper job, these are:

  • convert_raw_to_bin_double()
  • convert_raw_to_bin_float()
  • convert_raw_to_date()
  • convert_raw_to_number()
  • convert_raw_to_nvarchar()
  • convert_raw_to_rowid()
  • convert_raw_to_varchar()

You might note that these functions do not appear in the PL/SQL Packages and Types Reference, but they are documented in the script $ORACLE_HOME/rdbms/admin/dbmsstat.sql that creates the package:


-- Convert the internal representation of a minimum or maximum value
-- into a datatype-specific value. The minval and maxval fields
-- of the StatRec structure as filled in by get_column_stats or
-- prepare_column_values are appropriate values for input.

One thing you’ll notice about the list is that there’s no convert_raw_to_timestamp(), and a question came up recently on the oracle-l list server asking how to do this. This note answers that question and, in passing, demonstrates the typical use of the other functions.

As ever we start by creating some data:


rem
rem     Script:         raw_to_timestamp.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem

create table t1(
        t0 timestamp(0),
        t3 timestamp(3),
        t6 timestamp(6),
        t9 timestamp(9),
        ts timestamp
);

insert into t1
with t_tab as (
        select to_timestamp('01-Apr-2020 21:15:10.123456789','dd-mon-yyyy hh24:mi:ss.ff9') ts
        from dual
)
select  t_tab.ts, t_tab.ts, t_tab.ts, t_tab.ts, t_tab.ts
from    t_tab
;

insert into t1
with t_tab as (
        select to_timestamp('01-Apr-2020 21:15:10.987654321','dd-mon-yyyy hh24:mi:ss.ff9') ts
        from dual
)
select  t_tab.ts, t_tab.ts, t_tab.ts, t_tab.ts, t_tab.ts
from    t_tab
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 2'
        );
end;
/

I’ve creaed a table to hold timestamps constrained to different levels of precision. The maximum precision allowed is 9 decimal places for the seconds, the default (column ts) is 6. I’ve then created two rows in the table using slightly different timestamps for the rows but giving all the columns in a single row the same value. Then I’ve gathered stats – including a histogram – on the table and all its columns.

I can now query user_tab_cols to pick up the low ahd high values:


select
        column_name, low_value, high_value
from
        user_tab_cols
where
        table_name = 'T1'
order by
        column_name
/

COLUMN_NAME          LOW_VALUE                  HIGH_VALUE
-------------------- -------------------------- --------------------------
T0                   7878040116100B             7878040116100C
T3                   7878040116100B0754D4C0     7878040116100B3AE3AF00
T6                   7878040116100B075BCDE8     7878040116100B3ADE6770
T9                   7878040116100B075BCD15     7878040116100B3ADE68B1
TS                   7878040116100B075BCDE8     7878040116100B3ADE6770


And here’s a dump of the columns so that you can see the actual values held in the table in their internal representation.


select
        dump(t0,16), 
        dump(t3,16), 
        dump(t6,16), 
        dump(t9,16), 
        dump(ts,16) 
from 
        t1;


DUMP(T0,16)
-------------------------------------------------------------------------------------
DUMP(T3,16)
-------------------------------------------------------------------------------------
DUMP(T6,16)
-------------------------------------------------------------------------------------
DUMP(T9,16)
-------------------------------------------------------------------------------------
DUMP(TS,16)
-------------------------------------------------------------------------------------
Typ=180 Len=7: 78,78,4,1,16,10,b
Typ=180 Len=11: 78,78,4,1,16,10,b,7,54,d4,c0
Typ=180 Len=11: 78,78,4,1,16,10,b,7,5b,cd,e8
Typ=180 Len=11: 78,78,4,1,16,10,b,7,5b,cd,15
Typ=180 Len=11: 78,78,4,1,16,10,b,7,5b,cd,e8

Typ=180 Len=7: 78,78,4,1,16,10,c
Typ=180 Len=11: 78,78,4,1,16,10,b,3a,e3,af,0
Typ=180 Len=11: 78,78,4,1,16,10,b,3a,de,67,70
Typ=180 Len=11: 78,78,4,1,16,10,b,3a,de,68,b1
Typ=180 Len=11: 78,78,4,1,16,10,b,3a,de,67,70

You’ll notice that, despite the attempt to limit the precision, all the columns other than the first use 11 bytes. The first column is different, with zero precision we use only 7 bytes and if you look closely (and know your internal storage formats) you’ll realise that those 7 bytes are using the standard format for stored dates. The last 4 bytes hold the fraction of the second stored as a number of nano-seconds as a simple 32-bit binary number. (Warning: it’s possible that you will see a difference in this set of bytes if your machine uses a different endianness from mine – I can’t test that for myself at present.)

So let’s see what we get if we try to convert the raw values using the call to dbms_stats.convert_raw_to_date(). I’ll start by setting the nls_XXX formats to get the full conntent of a date column or a timestamp column reported from SQL*Plus.


alter session set nls_date_format = 'dd-Mon-yyyy hh24:mi:ss';
alter session set nls_timestamp_format = 'dd-Mon-yyyy hh24:mi:ss.ff9';

set linesize 165

column t0 format a32
column t3 format a32
column t6 format a32
column t9 format a32
column ts format a32

select  t0, t3, t6, t9, ts from t1;


select
        column_name, 
        dbms_stats.convert_raw_to_date(low_value)       date_low,
        dbms_stats.convert_raw_to_date(high_value)      date_high
from 
        user_tab_cols
where
        table_name = 'T1'
order by 
        column_name
/



T0                               T3                               T6                               T9                               TS
-------------------------------- -------------------------------- -------------------------------- -------------------------------- --------------------------------
01-Apr-2020 21:15:10.000000000   01-Apr-2020 21:15:10.123000000   01-Apr-2020 21:15:10.123457000   01-Apr-2020 21:15:10.123456789   01-Apr-2020 21:15:10.123457000
01-Apr-2020 21:15:11.000000000   01-Apr-2020 21:15:10.988000000   01-Apr-2020 21:15:10.987654000   01-Apr-2020 21:15:10.987654321   01-Apr-2020 21:15:10.987654000

2 rows selected.


COLUMN_NAME          DATE_LOW             DATE_HIGH
-------------------- -------------------- --------------------
T0                   01-Apr-2020 21:15:10 01-Apr-2020 21:15:11
T3                   01-Apr-2020 21:15:10 01-Apr-2020 21:15:10
T6                   01-Apr-2020 21:15:10 01-Apr-2020 21:15:10
T9                   01-Apr-2020 21:15:10 01-Apr-2020 21:15:10
TS                   01-Apr-2020 21:15:10 01-Apr-2020 21:15:10

5 rows selected.

Since I’ve inserted only two rows, with each row holding a single value with different precision, we expect those values to be reflected in the low_value and high_value for the columns – and we’re not disappointed (probably) but, as you might have expected, Oracle has processed the timestamp data type into a date data type by simply ignoring the last 4 bytes and dealing with the first 7 bytes as if they were a date. We need to do a little more check that we can convert the low and high values into exactly the values that appear in the table itself.

So we can write a messy bit of SQL that extracts the first 7 bytes, converts them using the convert_to_date() call, then extracts the last 4 bytes and converts them to a number using a simple to_number() call (with an nvl() thrown in to deal with the special case of there being no nanosecond component), divides by 1e9, converts the result to an interval in seconds using the numtointervalds() function, and adds that to the date. To keep things clean I’ll only apply the mess to the low_value column. You’ll note that I’ve


select 
        column_name, 
        low_value,
--
--      convert bytes 8 onwards to numeric with an nvl()
--      to handle the case of zero precision when there 
--      are no bytes to show the number of nanoseconds
--
--      Note - I've used substr() to bytes 8 onwards turns
--      into characters 15 - 22.
--
        nvl(
                to_number(
                        hextoraw(substr(low_value,15,8)),'XXXXXXXX'
                ),
                0
        ) nanoseconds,
--
--      repeat the conversion, and wrap with numtodsinterval()
--      to show the resulting interval in seconds.
--
        numtodsinterval(
                nvl(
                        to_number(
                                hextoraw(substr(low_value,15,8)),'XXXXXXXX'
                        )/1e9,
                        0
                )+0,
                'SECOND'
        ) interval_val,
--
--      Converted the 1st 7 bytes (14 characters) to a date,
--      coerce the result to a timestamp, then convert the 
--      nanoseconds bytes to an interval and add
--
        to_timestamp(
                dbms_stats.convert_raw_to_date(hextoraw(substr(low_value,1,14)))
        ) + 
        numtodsinterval(
                nvl(
                        to_number(
                                hextoraw(substr(low_value,15,8)),'XXXXXXXX'
                        )/1e9,
                        0
                ),
                'SECOND'
        ) timestamp_val
from 
        user_tab_cols
where
        table_name = 'T1'
order by
        column_name
/

COLUMN_NAME          LOW_VALUE                  NANOSECONDS INTERVAL_VAL                     TIMESTAMP_VAL
-------------------- -------------------------- ----------- -------------------------------- --------------------------------
T0                   7878040116100B                       0 +000000000 00:00:00.000000000    01-Apr-2020 21:15:10.000000000
T3                   7878040116100B0754D4C0       123000000 +000000000 00:00:00.123000000    01-Apr-2020 21:15:10.123000000
T6                   7878040116100B075BCDE8       123457000 +000000000 00:00:00.123457000    01-Apr-2020 21:15:10.123457000
T9                   7878040116100B075BCD15       123456789 +000000000 00:00:00.123456789    01-Apr-2020 21:15:10.123456789
TS                   7878040116100B075BCDE8       123457000 +000000000 00:00:00.123457000    01-Apr-2020 21:15:10.123457000

5 rows selected.

If you go back to the original output we got from selecting the actual values from the table you can see that the timestamp_val column we’ve generated correctly matches the column values for the lower of the two original values we inserted into the table.

Q.E.D.

October 30, 2019

Strange Estimates.

Filed under: CBO,extended stats,Oracle,Statistics — Jonathan Lewis @ 1:10 pm GMT Oct 30,2019

A question came up on the Oracle-L list server a couple of days ago expressing some surprise at the following execution plan:


--------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                 |       |       |   845 (100)|          |
|   1 |  SORT AGGREGATE                      |                 |     1 |     7 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| ANY_TABLE       | 84827 |   579K|   845   (1)| 00:00:01 |
|   3 |    SORT CLUSTER BY ROWID             |                 | 68418 |       |    76   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                 | ANY_INDEX       | 68418 |       |    76   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("X"."ANY_COLUMN1"='J')
   4 - access("X"."ANY_COLUMN2"=89155)

You’ll notice that this is a very simple query accessing a table by index, yet the estimated table rows found exceeds the estimated number of index entries used to probe the table. How can this happen. The answer (most frequently) is that there’s a mismatch between the table (or, more commonly, column) statistics and the index statistics. This seems to happen very frequently when you start mixing partitioned tables with global (or globally partitioned) indexes but it can happen in very simple cases, especially since a call to gather_table_stats() with cascade set to true and using the auto_sample_size will take a small sample from the index while using a 100% “sample” from the table.

Here’s an example I engineered very quickly to demonstrate the point. There’s no particular reason for the choice of DML I’ve used on the data beyond a rough idea of setting up a percentage of nulls and deleting a non-uniform pattern of rows.


rem
rem     Script:         table_index_mismatch.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem
create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,1000)                n1,
        mod(rownum,1000)                n2,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid WordPress format issue
;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1, for columns (n1,n2) size 1'
        );
end;
/

create index t1_i1 on t1(n1);

delete from t1 where mod(trunc(sqrt(n1)),7) = 0;
update t1 set n1 = null where mod(n1,10) = 0;
delete from t1 where mod(n1,10) = trunc(dbms_random.value(0,10));

execute dbms_stats.gather_table_stats(user,'t1',estimate_percent=>1)
execute dbms_stats.gather_index_stats(null,'t1_i1',estimate_percent=> 0.01)

Of course you’re not supposed to collect stats with arbitrary samples in any recent version of Oracle, so going for a 1% and 0.01% sample seems a little daft but I’m just doing that to demonstrate the problem with a very small data set.

After generating the data and gathering the stats I ran a few queries to pick out some critical numbers.


select
        table_name, sample_size, num_rows
from
        user_tables
where
        table_name = 'T1'
/

select 
        index_name, sample_size, num_rows, distinct_keys
from
        user_indexes
where
        table_name = 'T1'
and     index_name = 'T1_I1'
/

select
        column_name, sample_size, num_nulls, num_distinct
from
        user_tab_cols
where
        table_name = 'T1'
and     (
            column_name = 'N1'
         or virtual_column = 'YES'
        )
order by
        column_name
/

You’ll notice that I’ve only picked one of my original columns and any virtual columns. My gather_table_stats() call had a method_opt that included the creation of extended stats for the column group (n1, n2) and I want to report the stats on the resulting virtual column.


TABLE_NAME           SAMPLE_SIZE   NUM_ROWS
-------------------- ----------- ----------
T1                          7865     786500


INDEX_NAME           SAMPLE_SIZE   NUM_ROWS DISTINCT_KEYS
-------------------- ----------- ---------- -------------
T1_I1                     385779     713292           714


COLUMN_NAME                      SAMPLE_SIZE  NUM_NULLS NUM_DISTINCT
-------------------------------- ----------- ---------- ------------
N1                                      7012      85300          771
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS          7865          0          855

A couple of observations on the stats

  • the table sample size is, as expected, 1% of the reported num_rows (the actual count is 778,154).
  • The index sample size is much bigger than expected – but that’s probably related to the normal “select 1,100 leaf blocks strategy”. Because of the skew in the pattern of deleted values it’s possible for the sample size in this model to vary between 694,154 and something in the region of 380,000.
  • The n1 sample size is about 10% smaller than the table sample size – but that’s because I set 10% of the column to null.
  • The column group sample size matches the table sample size because column group hash values are never null, even if an underlying column is null.

So let’s check the execution plan for a very simple query:


set autotrace on explain
select id from t1 where n1 = 140 and n2 = 140;
set autotrace off


---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |   920 | 11960 |   918   (1)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |   920 | 11960 |   918   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T1_I1 |   909 |       |     5   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N2"=140)
   2 - access("N1"=140)

The estimate for relevant index rowids is smaller than the estimate for the number of table rows! The difference is not as extreme as the case reported on Oracle-l, but I’m only trying to demonstrate a principle, not reproduce the exact results.

There are several ways in which contradictory results like this can appear – but in this case we can see the following:

  • For the table access table.num_rows/column.num_distinct = 786,500 / 855 = 919.88  (using the column group num_distinct)
  • For the index range scan: (table.num_rows – column.num_nulls) / column.num_distinct = (786500 – 85300) / 771 = 909.47 (using the n1 statistics)

So the change in strategy as it becomes possible for the optimizer to take advantage of the column group means the index and table have been using incompatible sets of stats (in particular there’s that loss of information about NULLs) as their cardinalities are calculated. The question, then, is “how much is that likely to matter”, and the follow-up if it can matter is “in what circumstancs could the effect be large enough to cause problems”. But that’s a topic for another day.

Update / Footnote

In the case of the Oracle-l example, there was no column group, and in some cases the optimizer would produce a plan where the table estimate was much smaller than the index estimate, and in other cases (like the opening plan above) the table estimate was signficantly greater than the index estimate. This was a side effect of adaptive statistics: the low table estimate was due to the basic “multiply separate selectivities”; but the with adaptive statistics enabled Oracle started sampling the table to check the correlation between the two tables, and then produced an SQL Plan Directive to do so and got to the higher (and correct) result.

 

 

August 14, 2019

gather_system_stats

Filed under: CBO,Exadata,Oracle,Statistics,System Stats — Jonathan Lewis @ 2:20 pm BST Aug 14,2019

What happens when you execute dbms_stats.gather_system_stats() with the ‘Exadata’ option ?

Here’s what my system stats look like (12.2.0.1 test results) after doing so. (The code to generate the two different versions is at the end of the note).


System Stats
============
Status: COMPLETED
Timed: 13-Aug-2019 15:00:00 - 13-Aug-2019 15:00:00
--------------------------------------------------
CPUSPEED        :
CPUSPEEDNW      :          918
IOSEEKTIM       :           10
IOTFRSPEED      :      204,800
MAXTHR          :
MBRC            :          128
MREADTIM        :
SLAVETHR        :
SREADTIM        :

PL/SQL procedure successfully completed.

MBRC       :          128
MREADTIM   :
SREADTIM   :
CPUSPEED   :
CPUSPEEDNW :          918
IOSEEKTIM  :           10
IOTFRSPEED :      204,800
MAXTHR     :
SLAVETHR   :

PL/SQL procedure successfully completed.

All the code does is set the MBRC, IOSEEKTIM, and IOTFRSPEED to fixed values and the only real gather is the CPUSPEEDNW. The parameters showing blanks are deliberately set null by the procedure – before I called the gather_system_stats() every parameter had a value. You could also check the SQL trace file (with bind captured enabled) to see the statements that deliberately set those parameters to null if you want more proof.

What are the consequences of this call (assuming you haven’t also done something with the calibrate_io() procedure? Essentially Oracle now has information that says a single (8KB) block read request will take marginally over 10 milliseconds, and a multiblock read request of 1MB will take just over 15 milliseconds: in other words “tablescans are great, don’t use indexes unless they’re really precisely targetted”. To give you a quantitative feel for the numbers: given the choice between doing a tablescan of 1GB to pick 1,500 randomly scattered rows and using a perfect index the optimizer would choose the index.

To explain the time calculations: Oracle has set an I/O seek time of 10 ms, and a transfer rate of 204,800 bytes per ms (200 MB/s), with the guideline that a “typical” multiblock read is going to achieve 128 blocks. So the optimizer believes a single block read will take 10 + 8192/204800 ms = 10.04ms, while a multiblock read request for 1MB will take 10 + 1048576/204800 ms = 15.12 ms.

It’s also important to note that Oracle will use the 128 MBRC value in its calculation of the cost of the tablescan – even if you’ve set the db_file_mulitblock_read_count parameter for the session or system to something smaller; and if you have set the db_file_multiblock_read_count that’s the maximum size of multiblock read that the run-time engine will use for both cached reads (db file scattered read waits) and direct path reads.

Code

Here are the two procedures I used to report the values above. You will only need the privilege to execute the dbms_stats package for the second one, but you’ll need the privilege to access the SYS table aux_stats$ to use the first. The benefit of the first one is that it can’t go out of date as versions change – and, of  course, you could just run the SQL statement implied by the procedure; though wrapping the statement in a procedure means you could grant execute privileges on the procedure to non-sys users).


rem
rem     Script:         get_system_stats.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2002
rem
rem     Last tested
rem             18.3.0.0
rem             12.2.0.1
rem             12.1.0.2
rem             11.2.0.4
rem

set linesize 180
set trimspool on
set pagesize 60
set serveroutput on

spool get_system_stats

--      -----------------------------------------------------------
--      This bit will work only for SYS (references sys.aux_stats$
--      -----------------------------------------------------------

declare
        m_value         number;
        m_status        varchar2(64);
        m_start         date;
        m_stop          date;
begin
        for r1 in (
                select  rownum rn, pname
                from    sys.aux_stats$
                where   sname = 'SYSSTATS_MAIN'
        ) loop
                dbms_stats.get_system_stats(m_status, m_start, m_stop, r1.pname, m_value);
                if r1.rn = 1 then
                        dbms_output.put_line('System Stats');
                        dbms_output.put_line('============');
                        dbms_output.put_line('Status: ' || m_status);
                        dbms_output.put_line(
                                'Timed: ' ||
                                to_char(m_start,'dd-Mon-yyyy hh24:mi:ss') ||
                                ' - ' ||
                                to_char(m_stop ,'dd-Mon-yyyy hh24:mi:ss')
                        );
                        dbms_output.put_line('--------------------------------------------------');
                end if;
                dbms_output.put_line(rpad(r1.pname,15) ||  ' : ' || to_char(m_value,'999,999,999'));
        end loop;
end;
/

--      --------------------------------------------------------
--      This bit will work for anyone who can execute dbms_stats
--      --------------------------------------------------------

declare
        m_value         number;
        m_status        varchar2(64);
        m_start         date;
        m_stop          date;
begin
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'MBRC', m_value);
        dbms_output.put_line('MBRC       : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'MREADTIM', m_value);
        dbms_output.put_line('MREADTIM   : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'SREADTIM', m_value);
        dbms_output.put_line('SREADTIM   : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'CPUSPEED', m_value);
        dbms_output.put_line('CPUSPEED   : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'CPUSPEEDNW', m_value);
        dbms_output.put_line('CPUSPEEDNW : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'IOSEEKTIM', m_value);
        dbms_output.put_line('IOSEEKTIM  : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'IOTFRSPEED', m_value);
        dbms_output.put_line('IOTFRSPEED : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'MAXTHR', m_value);
        dbms_output.put_line('MAXTHR     : ' || to_char(m_value,'999,999,999'));
        dbms_stats.get_system_stats(m_status, m_start, m_stop, 'SLAVETHR', m_value);
        dbms_output.put_line('SLAVETHR   : ' || to_char(m_value,'999,999,999'));
end;
/

spool off

Update (Feb 2021)

Testing some mixed effects of gather_system_stats(‘EXADATA’) and set_systemt_stats({parameter},{value}) I’ve found that on 19c the call for Exadata stats usually sets the IOSEEKTIM to 10, but sometimes sets it to 0 or 1. (As mentioned above, 12.2.0.1 always seems to set it to 10.)

 

Next Page »

Website Powered by WordPress.com.