## May 28, 2009

### Frequency Histograms

Filed under: CBO,Histograms,Statistics,Tuning — Jonathan Lewis @ 7:34 pm BST May 28,2009

Update September 2018:
To create a frequency histogram in 12c the cumulative frequency has to be 2,050 or greater. The stats record also has a new array to hold “repeat counts”, but this will be allocated and set to zeros in the cases where you are creating a simple frequency histogram. The text of this note still refers to the example where the cumulative frequency total is the 1,000 from the original statement of the article.

Once I’d realised that this new limit was relevant in 12c it occurred to me to wonder if perhaps there’s an equivalent limit in 11g – and there is. If you try to create a fake frequency histogram with a cumulative frequency less than 256 you will get a height-balanced histogram.

In a recent article on changes to the calculation of cardinality with frequency histograms I made a comment about writing programs to create fake but realistic and complete frequency histograms to work around the problems of dramatic changes in execution plan that can occur if you let Oracle create the histograms by gathering stats.

As a follow-up to that comment, here’s a little script to demonstrate the method (for a numeric column), and a few points to remember when using this method.

We start with a very simple table, and collect statistics – excluding histograms:

```
rem
rem     Script:         column_stats_min.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jun 2002
rem

create table t1 as
select
rownum   n1
from    all_objects
where
rownum <= 10000 -- > comment to avoid wordpress format issue
;

begin
dbms_stats.gather_table_stats(
ownname          => user,
tabname          =>'T1',
method_opt       => 'for all columns size 1'
);
end;
/

```

The following anonymous pl/sql block then changes the column stats to add a histogram to the column – adjusting the density at the same time.

```
declare

m_distcnt               number;
m_density               number;
m_nullcnt               number;
srec                    dbms_stats.statrec;
m_avgclen               number;

n_array                 dbms_stats.numarray;

begin

m_distcnt       := 5;
m_density       := 0.02;
m_nullcnt       := 0;
m_avgclen       := 3;

n_array         := dbms_stats.numarray( 77,  81,   88,  91,    99) ;

srec.bkvals     := dbms_stats.numarray(  2,  40,  200,   3,   755);

--      Cumulative total needs to exceed 2050 in 12c to get a frequency histogram.
--      srec.bkvals     := dbms_stats.numarray(  6, 120,  600,   9,  2265);

--      rpcnts is a new feature for 12c Hybrid histograms, shown as zeros here.
--      srec.rpcnts     := dbms_stats.numarray(  0,   0,    0,   0,     0);

srec.epc        := 5;

dbms_stats.prepare_column_values(srec, n_array);

dbms_stats.set_column_stats(
ownname         => user,
tabname         => 't1',
colname         => 'n1',
distcnt         => m_distcnt,
density         => m_density,
nullcnt         => m_nullcnt,
srec            => srec,
avgclen         => m_avgclen
);

end;
/

```

By setting the n_array, I have listed the “popular” values that I want to tell the optimizer about. Note that these values must be in sorted order, lowest first. (The dbms_stats package also has the option to specify chararray, datearray, rawarray, fltarray (float) and dblarray (double) types – the last two arriving in 10g.)

The list of frequencies in srec.bkvals (bucket values) totals 1,000, and indicates that for every 1,000 rows in the table 2 rows will have the value 75, 40 rows will have the value 81, 200 rows will have the value 88 and so on.

By setting the density to 0.02 I have then instructed the optimizer that any predicate of the type “column = value_not_in_histogram” is supposed to return 2% of the rows in the table.  (Note – for purposes of demonstration this is deliberately not consistent with the histogram I have created. Note also that this changes in 10.2.0.4 – see closing paragraphs)

Having set the statistics I can now check to see what Oracle has recorded in the data dictionary:

```select
num_distinct,
low_value,
high_value,
density,
num_nulls,
num_buckets,
histogram
from
user_tab_columns
where
table_name = 'T1'
and	column_name = 'N1'
;

NUM_DISTINCT LOW_VALUE  HIGH_VALUE  DENSITY  NUM_NULLS NUM_BUCKETS HISTOGRAM
------------ ---------- ----------- ------- ---------- ----------- ---------
100 C14C       C164            .02          0           5 FREQUENCY

select
endpoint_value,
endpoint_number,
endpoint_number - nvl(prev_number,0) frequency
from	(
select
endpoint_value,
endpoint_number,
lag(endpoint_number,1) over(
order by endpoint_number
) prev_number
from
user_tab_histograms
where
table_name = 'T1'
and	column_name = 'N1'
)
order by
endpoint_value
;

ENDPOINT_VALUE ENDPOINT_NUMBER  FREQUENCY
-------------- --------------- ----------
75               2          2
81              42         40
88             242        200
91             245          3
99            1000        755
```

A frequency histogram (originally called a “value-based” histogram by Oracle) is stored as a “cumulative frequency histogram” – but the original frequencies can be retrieved by the use of the lag() or lead() analytic functions.

Having created the histogram, we can now test its effect – in this case using 10.2.0.3. We’ll use a simple ‘select where column = constant’, using the values 81 (which is in the histogram), 85 (which isn’t) and then 75  (which is in the histogram – but is an anomaly). Here, in order, are the three outputs from dbms_xplan.display() with CPU costing disabled:

```----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   400 |  1200 |     4 |
|*  1 |  TABLE ACCESS FULL| T1   |   400 |  1200 |     4 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=81)
```

Since we have 10,000 rows in the table, and our frequency histogram said the value 81 would appear in 40 rows out of every 1,000 the estimated cardinality is 400.

```----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   200 |   600 |     4 |
|*  1 |  TABLE ACCESS FULL| T1   |   200 |   600 |     4 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=85)
```

Since 85 does not appear in the histogram the optimizer multiplies the density (0.02) by the number of rows in the table to get 200 as the estimated cardinality.

```----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   200 |   600 |     4 |
|*  1 |  TABLE ACCESS FULL| T1   |   200 |   600 |     4 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=75)
```

Since we have 10,000 rows in the table, and our frequency histogram said the value 75 would appear in 2 rows out of every 1,000 the estimated cardinality is 20 – except there’s a special limiting case: the cardinality cannot be less than the “missing values” cardinality.

Of course, as I pointed out in the earlier article , things change in 11g and 10.2.0.4 – the optimizer seems to ignore the density. So 11g produces a cardinality of 20 for the value 75, and a cardinality of 10 (half the minimum known) for the value 85. Then the “linear decay” rule applies for values from 100 to 124 and 74 down to 50. So when you upgrade, instead of setting the density to describe how to treat missing values, you may need to add an extra value in the histogram reporting double the cardinality you want attributed to missing values – and the fake value may have to be chosen carefully to work around the effects of the linear decay algorithm.

[Further reading on Histograms]

## 25 Comments »

1. Jonathan,

very nice demonstration. I wasn’t aware of the “missing values limiting case” in the pre-10.2.0.4 and 11.1.0.6 behaviour.

“So 11g produces a cardinality of 20 for the value 75, and a cardinality of 10 (half the minimum known) for the value 81”

Shouldn’t this be 85 for the “half the minimum known” case, since 81 is part of the histogram?

Regards,
Randolf

Comment by Randolf Geist — May 28, 2009 @ 9:48 pm BST May 28,2009

2. I don’t know exactly what the “linear decay” rule means.

Could you explain the “linear decay” rule in a little more detail?

Comment by adenkang — May 29, 2009 @ 3:21 pm BST May 29,2009

There is a brief reference describing the principle in the previous post, and a description (with diagram) in my book.

As an example: assume you have a value that ranges between 0 and 100 (with limits stored by Oracle as the low_value and high_value for the column). If you run a query referencing a value outside that range Oracle 10g has a model that assumes you expect to get some data from the query, but the volume you get decreases as you move further from the known range.

The model uses the existing range as the centre point, and assumes your data covers three times that range – so if you query for values less than -100, or values above 200, the optimizer assumes there is no data; but if you query for a value between -100 and 0, or between +100 and +200 the optimizer uses a straight-line graph from the basic selectivity down to zero to model those ranges.

Comment by Jonathan Lewis — May 30, 2009 @ 10:34 am BST May 30,2009

3. Hi Jonathan,

What about the case when a literal is chosen outside the range but with a not equal.

I have 10.2.0.4 database with Siebel
7 million rows and a column with ‘Y’ for every row (i checked)

Strange enough histograms are created by gather_stats_job.
Dba_tab_col_statistics reports num_distinct = 1 and density 5e-8!!!

The query which joins about 16 tables has
the following condition:
flag != ‘N’ and flag=:bind_variable (which contains value Y)

The optimizer chooses and index skip scan on an index which contains the flag column.

My guess is that because of the != ‘N’ which is outside the histograms the optimizer falls back on the density.

Is my guess right?
I browsed your book on histograms and not equal clauses but could not find it.

regards Hans-Peter

Comment by Hans-Peter — August 11, 2009 @ 6:04 am BST Aug 11,2009

• Hans-Peter,

I’m slightly surprised that I failed to mention “!=” and histograms in the book, but I guess I had to miss a few things out.

Given you’re looking at 10.2.0.4 (which has made some changes as the previous article pointed out) this might be an area that’s changed anyway.

I always worry that “one” is a very special value for all sorts of reasons – I think Wolfgang Breitling has found various oddities with histograms and one distinct value in the past.

I may have some notes somewhere – I’ll take a look when I have a little time.

Comment by Jonathan Lewis — August 11, 2009 @ 6:18 am BST Aug 11,2009

4. […] path randomly (and catastrophically) because of a histogram collection problem, and suggested a short script to create a fixed representative histogram that would stop that problem re-appearing. Comments […]

Pingback by Block size – again « Oracle Scratchpad — January 1, 2010 @ 3:01 pm GMT Jan 1,2010

5. Hi Jonathan

It seems that you can force the old (pre 10.2.0.4) behavior by just getting and setting the stats for the column, without changing anything. The only effect this appears to have is to set user_stats=’YES’ on dba_tab_columns. The 10053 trace does not give any clues as to why this works, it just seems to revert to using the column density again for missing histogram values, when user_stats=’YES’.

I don’t know whether this is a bug or the intended behavior, but it gives you a finer grain of control by allowing you to choose the columns for which you want the old or the new behavior, without having to “alter session set “_fix_control”=’5483301:off’”.

The following test case demonstrates:

```------------------ Start of test case.
drop table x
/

create table x as
select 1 code,object_name from all_objects where rownum<=1000
union
select 2 code,object_name from all_objects where rownum<=1000
union
select 4 code,object_name from all_objects where rownum<=1000
;

exec dbms_stats.gather_table_stats(user,'X',method_opt => 'FOR ALL COLUMNS SIZE 254');

set autotrace traceonly explain

select * from x where code=2 -- value in histogram
/

select * from x where code=3 -- missing value
/

DECLARE
l_distcnt NUMBER DEFAULT NULL;
l_density NUMBER DEFAULT NULL;
l_nullcnt NUMBER DEFAULT NULL;
l_srec    DBMS_STATS.STATREC;
l_avgclen NUMBER DEFAULT NULL;

BEGIN

DBMS_STATS.GET_COLUMN_STATS (
ownname	=>user,
tabname	=>'X',
colname	=>'CODE',
distcnt	=>l_distcnt,
density	=>l_density,
nullcnt	=>l_nullcnt,
srec	=>l_srec,
avgclen	=>l_avgclen
);

DBMS_STATS.SET_COLUMN_STATS(
ownname =>user,
tabname =>'X',
colname =>'CODE',
distcnt =>l_distcnt,
density =>l_density,
nullcnt	=>l_nullcnt,
srec	=>l_srec,
avgclen	=>l_avgclen
);

END;
/

select * from x where code=3
/

------------------ End of test case.
```

Here’s the output:

```SQL> create table x as
2  select 1 code,object_name from all_objects where rownum&lt;=1000
3  union
4  select 2 code,object_name from all_objects where rownum&lt;=1000
5  union
6  select 4 code,object_name from all_objects where rownum;=1000
7  ;

Table created.

SQL>
SQL> exec dbms_stats.gather_table_stats(user,'X',method_opt => 'FOR ALL COLUMNS SIZE 254');

PL/SQL procedure successfully completed.

SQL>
SQL> set autotrace traceonly explain
SQL>
SQL> select * from x where code=2   -- value in histogram
2  /

Execution Plan
----------------------------------------------------------
Plan hash value: 2941724873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   992 | 16864 |     5   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| X    |   992 | 16864 |     5   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("CODE"=2)
--
--  Value in histogram, cardinality = 992
--
SQL>
SQL> select * from x where code=3   -- missing value
2  /

Execution Plan
----------------------------------------------------------
Plan hash value: 2941724873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   496 |  8432 |     5   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| X    |   496 |  8432 |     5   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("CODE"=3)
--
--  Value not in histogram, cardinality = 496 i.e. half the least value, as expected
--  for 10.2.0.4.
--

SQL>
SQL> DECLARE
2       l_distcnt     NUMBER       DEFAULT NULL;
3       l_density     NUMBER       DEFAULT NULL;
4       l_nullcnt     NUMBER       DEFAULT NULL;
5       l_srec        DBMS_STATS.STATREC;
6       l_avgclen     NUMBER       DEFAULT NULL;
7
8    BEGIN
9
10    DBMS_STATS.GET_COLUMN_STATS (
11       ownname=> user,
12       tabname=> 'X',
13       colname=> 'CODE',
14       distcnt=> l_distcnt,
15       density=> l_density,
16       nullcnt=> l_nullcnt,
17       srec=> l_srec,
18       avgclen=> l_avgclen);
19
20    DBMS_STATS.SET_COLUMN_STATS(
21       ownname=> user,
22       tabname=> 'X',
23       colname=> 'CODE',
24       distcnt=> l_distcnt,
25       density=> l_density,
26       nullcnt=> l_nullcnt,
27       srec=> l_srec,
28       avgclen=> l_avgclen);
29
30    END;
31  /

PL/SQL procedure successfully completed.

SQL>
SQL> select * from x where code=3
2  /

Execution Plan
----------------------------------------------------------
Plan hash value: 2941724873

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     1 |    17 |     5   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| X    |     1 |    17 |     5   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("CODE"=3)

--
--  Cardinality = 1 i.e. the pre 10.2.0.4 behavior.
--
```

Regards,
Mike Barratt

Comment by Mike Barratt — January 7, 2010 @ 12:28 pm GMT Jan 7,2010

• Mike,

Very interesting. I wonder if it’s deliberate or an accident. Certainly worth checking in 11g. [Update: just ran your example in 11.1.0.6, and the effect does not appear there.]

This could be the answer to the concern I raised in the last paragraph about the increased complexity of imposing a selectivity for missing values – the old method will still work.

Comment by Jonathan Lewis — January 7, 2010 @ 9:55 pm GMT Jan 7,2010

• Mike, Jonathan

I do not have any 11gR1 to test. However it seems this unsuitable behaviour (at least to my mind) is back in 11gR2 :-/

Regards
Pavol Babel

Comment by Pavol Babel — April 24, 2014 @ 3:19 pm BST Apr 24,2014

• Pavol,

The posting is so old that I’d have to start reading from scratch to get up to speed on it and cross-check the stats.

I do recall seeing an article on MoS, though, saying something about anomalous behaviour to do with the density calculation and things that can make the optimizer switch between the old and new methods – this was some time ago, though, and I don’t remember the details. It’s the type of thing I remember and look up when I hit a specific problem

Comment by Jonathan Lewis — April 25, 2014 @ 6:09 pm BST Apr 25,2014

6. […] posted an example of creating a frequency histogram a few months ago – and it doesn’t take much to turn this into an example for the […]

Pingback by Fake Histograms « Oracle Scratchpad — March 23, 2010 @ 8:49 pm GMT Mar 23,2010

7. […] Related reading: how to create an “artificial” frequency histogram. […]

Pingback by Frequency Histogram 3 « Oracle Scratchpad — September 24, 2010 @ 6:47 pm BST Sep 24,2010

8. […] solution, of course, is to write code to create a histogram that represents the situation as you want Oracle to see […]

Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010

9. […] to help the optimizer in the most truthful way possible. List partitions (in a way similar to frequency histograms) are an obvious target for this type of […]

Pingback by Local Indexes – 2 « Oracle Scratchpad — November 14, 2010 @ 5:42 pm GMT Nov 14,2010

10. […] huge number of closed orders; we’d also have to create a histogram on the column (possibly by writing a simple script) so that Oracle could recognise the skewed data […]

Pingback by FBI oddities « Oracle Scratchpad — December 16, 2010 @ 6:18 pm GMT Dec 16,2010

11. […] the query to make use of the virtual column, so in some cases it may be a better solution to use another trick suggested by J. Lewis: use “hand-made” histograms that provide a more realistic distribution. Of course, it […]

Pingback by Histograms for strongly skewed columns « Oracle Diagnostician — May 23, 2012 @ 4:35 pm BST May 23,2012

12. Hi Jonathan.

There is something strange with density with 11.2.0.2 ..

Table stats:
#Rows: 1000000 #Blks: 47414 AvgRowLen: 330.00 ChainCnt: 0.00

I have a column with only one distinct value.
AvgLen: 3 NDV: 1 Nulls: 0 Density: 1.000000

Now, after i’ve updated 1000 rows to null, the density dropped down .
AvgLen: 3 NDV: 1 Nulls: 1000 Density: 0.000001
Histogram: Freq #Bkts: 1 UncompBkts: 5452 EndPtVals: 1

Comment by Sagi Zvi — October 27, 2012 @ 5:28 pm BST Oct 27,2012

• Sagi,

First you need to work out what you did to get these results – then you can explain how I could reproduce the example.
As it is you haven’t told me how you collected the statistics, or what the table looks like, or how it’s indexed, or what query you ran. Moreover, you haven’t given me any assurance that the pieces of the trace file are equivalent, so you may have omitted lines from the first sample that you’ve shown in the second.

As far as “strange” is concerned – I think it’s possible that you collected stats the first time in a way that got 100% sample and no histogram, and in the second case generated a histogram off a small sample – although I would have expected 5e-7 for the second density. However, I don’t see the “NewDensity” figures, so I’m not sure you haven’t been running with optimizer_features_enable set to a lower value.

Comment by Jonathan Lewis — December 13, 2012 @ 3:01 pm GMT Dec 13,2012

13. Thanks for the article!
How to interpret endpoint_value for non-numeric columns?
I have a column with frequency histogram on varchar column and this puzzles me.
E.g. 4.4134523297546E35 has to be translated into a 1-2 character string somehow.
Regards

Comment by dautkhanov — October 2, 2013 @ 7:19 pm BST Oct 2,2013

14. […] the query to make use of the virtual column, so in some cases it may be a better solution to use another trick suggested by J. Lewis: use “hand-made” histograms that provide a more realistic distribution. Of course, it […]

Pingback by Histograms for strongly skewed columns | Oracle Diagnostician — February 3, 2015 @ 11:07 am GMT Feb 3,2015

15. […] a frequency – or Top-N frequency in 12c – histogram on capitn_prcs_ind, and may even “fake” it to ensure that an automatic call to gather table stats didn’t do something that made the […]

Pingback by Understanding SQL | Oracle Scratchpad — June 1, 2015 @ 4:54 pm BST Jun 1,2015

16. […] included the value (‘N’,’N’) giving it a frequency of 1 – a method I’ve suggested in the past  but with the little problem that you need to be able to work out the value to use in the array […]

Pingback by Extended Histograms | Oracle Scratchpad — July 31, 2018 @ 11:05 pm BST Jul 31,2018

17. […] Faking a frequency histogram    How to create frequency histograms (using a numeric column for the example) […]

Pingback by Faking Histograms | Oracle Scratchpad — October 15, 2018 @ 1:37 pm BST Oct 15,2018

This site uses Akismet to reduce spam. Learn how your comment data is processed.