In a recent article on changes to the calculation of ** cardinality with frequency histograms** I made a comment about writing programs to create fake but realistic and complete frequency histograms to work around the problems of dramatic changes in execution plan that can occur if you let Oracle create the histograms by gathering stats.

As a follow-up to that comment, here’s a little script to demonstrate the method (for a numeric column), and a few points to remember when using this method.

We start with a very simple table, and collect statistics – excluding histograms:

create table t1 as select rownum n1 from all_objects where rownum <= 10000 ; begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T1', estimate_percent => 100, method_opt => 'for all columns size 1' ); end; /

The following anonymous pl/sql block then changes the column stats to add a histogram to the column – adjusting the * density* at the same time.

declare m_distcnt number; m_density number; m_nullcnt number; srec dbms_stats.statrec; m_avgclen number; n_array dbms_stats.numarray; begin m_distcnt := 100; m_density := 0.02; m_nullcnt := 0; m_avgclen := 3; n_array := dbms_stats.numarray(75, 81, 88, 91, 99); srec.bkvals := dbms_stats.numarray( 2, 40, 200, 3, 755); srec.epc := 5; dbms_stats.prepare_column_values(srec, n_array); dbms_stats.set_column_stats( ownname => user, tabname => 't1', colname => 'n1', distcnt => m_distcnt, density => m_density, nullcnt => m_nullcnt, srec => srec, avgclen => m_avgclen ); end; /

By setting the** n_array**, I have listed the

*“popular”*values that I want to tell the optimizer about. Note that these values must be in sorted order, lowest first. (The

*package also has the option to specify chararray, datearray, rawarray, fltarray (float) and dblarray (double) types – the last two arriving in 10g.)*

**dbms_stats**The list of frequencies in ** srec.bkvals** (bucket values) totals 1,000, and indicates that for every 1,000 rows in the table 2 rows will have the value 75, 40 rows will have the value 81, 200 rows will have the value 88 and so on.

By setting the ** density** to 0.02 I have then instructed the optimizer that any predicate of the type

*“column = value_not_in_histogram”*is supposed to return 2% of the rows in the table. (Note – for purposes of demonstration this is deliberately not consistent with the histogram I have created. Note also that this changes in 10.2.0.4 – see closing paragraphs)

Having set the statistics I can now check to see what Oracle has recorded in the data dictionary:

select num_distinct, low_value, high_value, density, num_nulls, num_buckets, histogram from user_tab_columns where table_name = 'T1' and column_name = 'N1' ; NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS NUM_BUCKETS HISTOGRAM ------------ ---------- ----------- ------- ---------- ----------- --------- 100 C14C C164 .02 0 5 FREQUENCY select endpoint_value, endpoint_number, endpoint_number - nvl(prev_number,0) frequency from ( select endpoint_value, endpoint_number, lag(endpoint_number,1) over( order by endpoint_number ) prev_number from user_tab_histograms where table_name = 'T1' and column_name = 'N1' ) order by endpoint_value ; ENDPOINT_VALUE ENDPOINT_NUMBER FREQUENCY -------------- --------------- ---------- 75 2 2 81 42 40 88 242 200 91 245 3 99 1000 755

A frequency histogram (originally called a *“value-based”* histogram by Oracle) is stored as a *“cumulative frequency histogram”* – but the original frequencies can be retrieved by the use of the * lag()* or

**lead()***functions.*

**analytic**Having created the histogram, we can now test its effect – in this case using 10.2.0.3. We’ll use a simple *‘select where column = constant’*, using the values 81 (which is in the histogram), 85 (which isn’t) and then 75 (which is in the histogram – but is an anomaly). Here, in order, are the three outputs from ** dbms_xplan.display()** with CPU costing disabled:

---------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------- | 0 | SELECT STATEMENT | | 400 | 1200 | 4 | |* 1 | TABLE ACCESS FULL| T1 | 400 | 1200 | 4 | ---------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("N1"=81)

Since we have 10,000 rows in the table, and our frequency histogram said the value 81 would appear in 40 rows out of every 1,000 the estimated cardinality is 400.

---------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------- | 0 | SELECT STATEMENT | | 200 | 600 | 4 | |* 1 | TABLE ACCESS FULL| T1 | 200 | 600 | 4 | ---------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("N1"=85)

Since 85 does not appear in the histogram the optimizer multiplies the * densitry* (0.02) by the number of rows in the table to get 200 as the estimated cardinality.

---------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------- | 0 | SELECT STATEMENT | | 200 | 600 | 4 | |* 1 | TABLE ACCESS FULL| T1 | 200 | 600 | 4 | ---------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("N1"=75)

Since we have 10,000 rows in the table, and our frequency histogram said the value 75 would appear in 2 rows out of every 1,000 the estimated cardinality is 20 – except there’s a special limiting case: the cardinality cannot be less than the “missing values” cardinality.

Of course, ** as I pointed out in the earlier article **, things change in 11g and 10.2.0.4 – the optimizer seems to ignore the

*. So 11g produces a cardinality of 20 for the value 75, and a cardinality of 10 (half the minimum known) for the value 85. Then the*

**density***“linear decay”*rule applies for values from 100 to 124 and 74 down to 50. So when you upgrade, instead of setting the

*to describe how to treat missing values, you may need to add an extra value in the histogram reporting double the cardinality you want attributed to missing values – and the fake value may have to be chosen carefully to work around the effects of the linear decay algorithm.*

**density**
Jonathan,

very nice demonstration. I wasn’t aware of the “missing values limiting case” in the pre-10.2.0.4 and 11.1.0.6 behaviour.

“So 11g produces a cardinality of 20 for the value 75, and a cardinality of 10 (half the minimum known) for the value 81″

Shouldn’t this be 85 for the “half the minimum known” case, since 81 is part of the histogram?

Regards,

Randolf

Comment by Randolf Geist — May 28, 2009 @ 9:48 pm BST May 28,2009 |

Randolf,

Thanks for pointing that out; now fixed.

Comment by Jonathan Lewis — May 28, 2009 @ 10:16 pm BST May 28,2009 |

I don’t know exactly what the “linear decay” rule means.

Could you explain the “linear decay” rule in a little more detail?

Comment by adenkang — May 29, 2009 @ 3:21 pm BST May 29,2009 |

Adenkang,

There is a brief reference describing the principle in the previous post, and a description (with diagram) in my book.

As an example: assume you have a value that ranges between 0 and 100 (with limits stored by Oracle as the low_value and high_value for the column). If you run a query referencing a value outside that range Oracle 10g has a model that assumes you expect to get some data from the query, but the volume you get decreases as you move further from the known range.

The model uses the existing range as the centre point, and assumes your data covers three times that range – so if you query for values less than -100, or values above 200, the optimizer assumes there is no data; but if you query for a value between -100 and 0, or between +100 and +200 the optimizer uses a straight-line graph from the basic selectivity down to zero to model those ranges.

Comment by Jonathan Lewis — May 30, 2009 @ 10:34 am BST May 30,2009 |

Hi Jonathan,

What about the case when a literal is chosen outside the range but with a not equal.

I have 10.2.0.4 database with Siebel

7 million rows and a column with ‘Y’ for every row (i checked)

Strange enough histograms are created by gather_stats_job.

Dba_tab_col_statistics reports num_distinct = 1 and density 5e-8!!!

The query which joins about 16 tables has

the following condition:

flag != ‘N’ and flag=:bind_variable (which contains value Y)

The optimizer chooses and index skip scan on an index which contains the flag column.

My guess is that because of the != ‘N’ which is outside the histograms the optimizer falls back on the density.

Is my guess right?

I browsed your book on histograms and not equal clauses but could not find it.

regards Hans-Peter

Comment by Hans-Peter — August 11, 2009 @ 6:04 am BST Aug 11,2009 |

Hans-Peter,

I’m slightly surprised that I failed to mention “!=” and histograms in the book, but I guess I had to miss a few things out.

Given you’re looking at 10.2.0.4 (which has made some changes as

pointed out) this might be an area that’s changed anyway.the previous articleI always worry that “one” is a very special value for all sorts of reasons – I think

has found various oddities with histograms and one distinct value in the past.Wolfgang BreitlingI may have some notes somewhere – I’ll take a look when I have a little time.

Comment by Jonathan Lewis — August 11, 2009 @ 6:18 am BST Aug 11,2009 |

[...] path randomly (and catastrophically) because of a histogram collection problem, and suggested a short script to create a fixed representative histogram that would stop that problem re-appearing. Comments [...]

Pingback by Block size – again « Oracle Scratchpad — January 1, 2010 @ 3:01 pm BST Jan 1,2010 |

Hi Jonathan

It seems that you can force the old (pre 10.2.0.4) behavior by just getting and setting the stats for the column, without changing anything. The only effect this appears to have is to set user_stats=’YES’ on dba_tab_columns. The 10053 trace does not give any clues as to why this works, it just seems to revert to using the column density again for missing histogram values, when user_stats=’YES’.

I don’t know whether this is a bug or the intended behavior, but it gives you a finer grain of control by allowing you to choose the columns for which you want the old or the new behavior, without having to “alter session set “_fix_control”=’5483301:off’”.

The following test case demonstrates:

Here’s the output:

Regards,

Mike Barratt

Comment by Mike Barratt — January 7, 2010 @ 12:28 pm BST Jan 7,2010 |

Mike,

Very interesting. I wonder if it’s deliberate or an accident. Certainly worth checking in 11g. [

Update: just ran your example in 11.1.0.6, and the effect does not appear there.]This could be the answer to the concern I raised in the last paragraph about the increased complexity of imposing a selectivity for missing values – the old method will still work.

Comment by Jonathan Lewis — January 7, 2010 @ 9:55 pm BST Jan 7,2010 |

[...] posted an example of creating a frequency histogram a few months ago – and it doesn’t take much to turn this into an example for the [...]

Pingback by Fake Histograms « Oracle Scratchpad — March 23, 2010 @ 8:49 pm BST Mar 23,2010 |

[...] Related reading: how to create an “artificial” frequency histogram. [...]

Pingback by Frequency Histogram 3 « Oracle Scratchpad — September 24, 2010 @ 6:47 pm BST Sep 24,2010 |

[...] solution, of course, is to write code to create a histogram that represents the situation as you want Oracle to see [...]

Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010 |

[...] to help the optimizer in the most truthful way possible. List partitions (in a way similar to frequency histograms) are an obvious target for this type of [...]

Pingback by Local Indexes – 2 « Oracle Scratchpad — November 14, 2010 @ 5:42 pm BST Nov 14,2010 |

[...] huge number of closed orders; we’d also have to create a histogram on the column (possibly by writing a simple script) so that Oracle could recognise the skewed data [...]

Pingback by FBI oddities « Oracle Scratchpad — December 16, 2010 @ 6:18 pm BST Dec 16,2010 |

[...] the query to make use of the virtual column, so in some cases it may be a better solution to use another trick suggested by J. Lewis: use “hand-made” histograms that provide a more realistic distribution. Of course, it [...]

Pingback by Histograms for strongly skewed columns « Oracle Diagnostician — May 23, 2012 @ 4:35 pm BST May 23,2012 |

Hi Jonathan.

There is something strange with density with 11.2.0.2 ..

Table stats:

#Rows: 1000000 #Blks: 47414 AvgRowLen: 330.00 ChainCnt: 0.00

I have a column with only one distinct value.

AvgLen: 3 NDV: 1 Nulls: 0 Density: 1.000000

Now, after i’ve updated 1000 rows to null, the density dropped down .

AvgLen: 3 NDV: 1 Nulls: 1000 Density: 0.000001

Histogram: Freq #Bkts: 1 UncompBkts: 5452 EndPtVals: 1

any thoughts about this?

Comment by Sagi Zvi — October 27, 2012 @ 5:28 pm BST Oct 27,2012 |

Sagi,

First you need to work out what you did to get these results – then you can explain how I could reproduce the example.

As it is you haven’t told me how you collected the statistics, or what the table looks like, or how it’s indexed, or what query you ran. Moreover, you haven’t given me any assurance that the pieces of the trace file are equivalent, so you may have omitted lines from the first sample that you’ve shown in the second.

As far as “strange” is concerned – I think it’s possible that you collected stats the first time in a way that got 100% sample and no histogram, and in the second case generated a histogram off a small sample – although I would have expected 5e-7 for the second density. However, I don’t see the “NewDensity” figures, so I’m not sure you haven’t been running with optimizer_features_enable set to a lower value.

Comment by Jonathan Lewis — December 13, 2012 @ 3:01 pm BST Dec 13,2012 |

Thanks for the article!

How to interpret endpoint_value for non-numeric columns?

I have a column with frequency histogram on varchar column and this puzzles me.

E.g. 4.4134523297546E35 has to be translated into a 1-2 character string somehow.

Regards

Comment by dautkhanov — October 2, 2013 @ 7:19 pm BST Oct 2,2013 |

http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/

Comment by Jonathan Lewis — October 2, 2013 @ 7:28 pm BST Oct 2,2013 |