Today’s little puzzle comes courtesy of * the Oracle-L mailing list*. A table has two columns (

*and*

**c2***), which contain only the values*

**c3***‘Y’*and

*‘N’*, with the following distribution:

select c2, c3, count(*) from t1 group by c2, c3 ; C C COUNT(*) - - ---------- N Y 1994 Y N 71482 2 rows selected.

The puzzle is this – how do you get the optimizer to predict a cardinality of zero (or, using its best approximation, 1) if you execute a query where the predicate is:

where c2 = 'N' and c3 = 'N'

Here are 4 tests you might try:

- Create simple stats (no histograms) on c2 and c3.
- Create frequency histograms on c2 and c3
- Create a column group (extended stats) on (c2,c3) but no histograms
- Create a column group (extended stats) on (c2,c3) with a histogram on (c2, c3)

If you do these tests you’ll find the estimated cardinalities are (from 12.1.0.2):

- 18,369 – derived as 73,476 / 4 … total rows over total possible combinations
- 1,940 – derived as 73,476 * (1,994/73,476) * (71,482/73,476) … total rows * fraction where c2 = ‘N’ * fraction where c3 = ‘N’
- 36,738 – derived as 73,476 / 2 … total rows / number of distinct combinations of
*(c2, c3)* - 997 – derived as 1,994 / 2 … half the frequency of the least frequently occurring value in the histogram

The last * algorithm appeared in 10.2.0.4*; prior to that a

*“value not in frequency histogram”*would have been given an estimated cardinality of 1 (which is what the person on Oracle-L wanted to see).

In fact the optimizer’s behaviour can be reverted to the 10.2.0.3 mechanism by setting fix-control 5483301 to zero (or off), either with an *“alter session”* call or inside the * /*+ opt_param() */* hint. There is, however, another option – if you get the column stats, then immediately set them (

*,*

**dbms_stats.get_column_stats()***) the optimizer defines the stats as*

**dbms_stats.set_column_stats()***“user defined”*and (for reasons I don’t know – perhaps it’s an oversight) reverts to the 10.2.0.3 behaviour. Here’s some code to demonstrate the point; as the srcipt header says, I’ve tested it on versions up to 18.1

rem rem Script: histogram_hack_2.sql rem Author: Jonathan Lewis rem Dated: Jul 2018 rem rem Last tested rem 18.1.0.0 via LiveSQL (with some edits) rem 12.2.0.1 rem 12.1.0.2 rem create table t1 as select 'Y' c2, 'N' c3 from all_objects where rownum <= 71482 -- > comment to avoid format issue union all select 'N' c2, 'Y' c3 from all_objects where rownum <= 1994 -- > comment to avoid format issue ; execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 10 for columns (c2,c3) size 10'); column column_name format a128 new_value m_colname select column_name from user_tab_cols where table_name = 'T1' and column_name not in ('C2','C3') ; set autotrace traceonly explain select /* pre-hack */ * from t1 where c2 = 'N' and c3 = 'N'; set autotrace off declare l_distcnt number default null; l_density number default null; l_nullcnt number default null; l_srec dbms_stats.statrec; l_avgclen number default null; begin dbms_stats.get_column_stats ( ownname =>user, tabname =>'t1', colname =>'&m_colname', distcnt =>l_distcnt, density =>l_density, nullcnt =>l_nullcnt, srec =>l_srec, avgclen =>l_avgclen ); dbms_stats.set_column_stats( ownname =>user, tabname =>'t1', colname =>'&m_colname', distcnt =>l_distcnt, density =>l_density, nullcnt =>l_nullcnt, srec =>l_srec, avgclen =>l_avgclen ); end; / set autotrace traceonly explain select /* post-hack */ * from t1 where c2 = 'N' and c3 = 'N'; set autotrace off

I’ve created a simple table for the data and collected stats including histograms on the two columns and on the column group. I’ve taken a simple strategy to find the name of the column group (I could have used the function * dbms_stats.create_extended_stats()* to set an SQL variable to the name of the column group, of course), and then run a little bit of PL/SQL that literally does nothing more than copy the column group’s stats into memory then write them back to the data dictionary.

Here are the “before” and “after” execution plans that we get from autotrace:

BEFORE -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 997 | 3988 | 23 (27)| 00:00:01 | |* 1 | TABLE ACCESS FULL| T1 | 997 | 3988 | 23 (27)| 00:00:01 | -------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("C2"='N' AND "C3"='N') AFTER -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 4 | 23 (27)| 00:00:01 | |* 1 | TABLE ACCESS FULL| T1 | 1 | 4 | 23 (27)| 00:00:01 | -------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("C2"='N' AND "C3"='N')

As required – the estimate for the (‘N’,’N’) rows drops down to (the optimizer’s best approximation to ) zero.

### Footnote:

An alternative strategy (and, I’d say, a better strategic approach) would have been to create a “fake” frequency histogram that included the value (‘N’,’N’) giving it a frequency of 1 – a method * I’ve suggested in the past* but with the little problem that you need to be able to work out the value to use in the array passed to

*to represent the value for the (‘N’,’N’) combination – and I’ve written about that topic*

**dbms_stats.set_column_stats()***.*

**in the past as well**You might wonder why the optimizer is programmed to use *“half the least popular”* for predicates references values not in the index. Prior to 12c it’s easy to make an argument for the algorithm. Frequency histograms used to be sampled with a very small sample size, so if you were unlucky a “slightly less popular” value could be missed completely in the sample; if you were requesting a value that didn’t appear in the histogram then (presumably) you knew it should exist in the data, so guessing a cardinality somewhat less than the least popular must have seemed like a good idea.

In 12c, of course, you ought to be taking advantage of the * “approximate NDV” implementation* for using a 100% sample to generate frequency (and Top-N / Top-Frequency histograms). If you’ve got a 12c frequency histogram then the absence of a value in the histogram means the data really wasn’t there so a cardinality estimate of 1 makes more sense. (Of course, you might have allowed Oracle to gather the histogram at the wrong time – but that’s a different issue). If you’ve got a Top-N histogram then the optimizer will behave as if a “missing” value is one of those nominally allowed for in the “low frequency” bucket and use neither the 1 nor the “half the least popular”.

So, for 12c and columns with frequency histograms it seems perfectly reasonably to set the fix control to zero – after getting approval from Oracle support, of course.

[…] on from the previous posting which raised the idea of faking a frequency histogram for a column group (extended stats), this is […]

Pingback by Extended Histograms – 2 | Oracle Scratchpad — August 2, 2018 @ 2:14 pm BST Aug 2,2018 |