I got an email a few days ago asking me how to create a *“height-balanced”* histogram using the * dbms_stats.set_column_stats()* procedure.

I posted * an example of creating a frequency histogram* a few months ago – and it doesn’t take much to turn this into an example for the height-balanced type; the key change apppears in three critical lines.

In the original I had the following description for the histogram:

n_array := dbms_stats.numarray(75, 81, 88, 91, 99); srec.bkvals := dbms_stats.numarray( 2, 40, 200, 3, 755); srec.epc := 5;

This sets up the data for a frequency histogram with five distinct values (those shown in the * n_array*) with the (relative) frequencies shown in the

*array.*

**src.bkvals**Here’s the change I have to make to define a height-balanced histogram with 10 buckets (which means 11 “end-points”).

n_array := dbms_stats.numarray(20, 60, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 4000, 10000); srec.bkvals := null; srec.epc := 11;

These figures say that the column has a low value of 20 and a high value of 10,000; and with 10 buckets each bucket represents one tenth of the available data (* dba_table.num_rows* –

*). Note that the value 3,000 appears six times in the*

**dba_tab_columns.num_nulls***; this is an example of a popular value (the repetitions will be “compressed” out of the data dictionary when you query*

**n_array***).*

**dba_tab_histograms**Here’s a complete example so that you don’t have to cut-n-paste two articles into one. I haven’t changed the * distcnt*,

*and*

**density***from the values I used for the frequency histogram – but they’re not the point of interest in this case. I’ve also added a couple of queries at the end of the example to show you what gets into the data dictionary, and a couple of queries against the table so you can see the impact of the histogram on the cardinality calculations.*

**nullcnt**create table t1 as select rownum n1 from all_objects where rownum <= 10000 ; begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T1', estimate_percent => 100, method_opt => 'for all columns size 1' ); end; / declare m_distcnt number; m_density number; m_nullcnt number; srec dbms_stats.statrec; m_avgclen number; n_array dbms_stats.numarray; begin m_distcnt := 100; m_density := 0.02; m_nullcnt := 0; m_avgclen := 3; n_array := dbms_stats.numarray(20, 60, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 4000, 10000); srec.bkvals := null; srec.epc := 11; dbms_stats.prepare_column_values(srec, n_array); dbms_stats.set_column_stats( ownname => user, tabname => 't1', colname => 'n1', distcnt => m_distcnt, density => m_density, nullcnt => m_nullcnt, srec => srec, avgclen => m_avgclen ); end; / select num_distinct, low_value, high_value, density, num_nulls, num_buckets, histogram from user_tab_columns where table_name = 'T1' and column_name = 'N1' ; select endpoint_value, endpoint_number, lag(endpoint_number,1) over( order by endpoint_number ) prev_number from user_tab_histograms where table_name = 'T1' and column_name = 'N1' order by endpoint_value ; set autotrace traceonly explain select * from t1 where n1 = 3000; select * from t1 where n1 between 7500 and 8000; set autotrace off

And here are the results from the queries at the end of the example, when run against 10.2.0.3.

NUM_DISTINCT LOW_VALUE HIGH_VALUE DENSITY NUM_NULLS NUM_BUCKETS HISTOGRAM ------------ ------------------------ ------------------------ ---------- ---------- ----------- --------------- 100 C115 C302 .02 0 10 HEIGHT BALANCED 1 row selected. ENDPOINT_VALUE ENDPOINT_NUMBER PREV_NUMBER -------------- --------------- ----------- 20 0 60 1 0 2999 2 1 3000 8 2 4000 9 8 10000 10 9 6 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 3617692013 ---------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------- | 0 | SELECT STATEMENT | | 6000 | 18000 | 4 | |* 1 | TABLE ACCESS FULL| T1 | 6000 | 18000 | 4 | ---------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("N1"=3000) Execution Plan ---------------------------------------------------------- Plan hash value: 3617692013 ---------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ---------------------------------------------------------- | 0 | SELECT STATEMENT | | 83 | 249 | 4 | |* 1 | TABLE ACCESS FULL| T1 | 83 | 249 | 4 | ---------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("N1">=7500 AND "N1"<=8000)

As you can see, all the statistics I supplied have been written into the database, and the histogram figures show the “storage compression” effect on the popular 3,000. (Ignore the “prev_number” column in that output, it’s only relevent for frequency histograms).

The first query (n1 = 3000) shows that the optimizer has picked up the 6 buckets that end at 3,000 to get a cardinality of 6,000 of the 10,000 rows; and the range predicate (n1 between 7500 and 8000) covers a range of 500 from the last bucket which has a range of 6,000 (viz: 4000 – 10000) gives it a cardinality of 1,000 * 500/6,000 = 83.3

Dear Sir,

how CBO calculate the cadinality of 6000 for n1=3000

Could you please expalin ?

Many thanks

Comment by Henish — March 24, 2010 @ 7:12 pm BST Mar 24,2010 |

Henish,

There are 6 buckets which end with 3,000 – and each bucket represents 1,000 rows.

It would appear that (in this version) Oracle is simply assuming that each bucket that references a popular value (i.e. a value that appears as an end-point more than once) is full of that value. 6 * 1,000 = 6,000.

Comment by Jonathan Lewis — March 26, 2010 @ 9:32 pm BST Mar 26,2010 |

This approach is also published by Wolfgang Breitling.

the paper Active Statistics

http://www.centrexcc.com/Active%20Statistics.ppt.pdf

Page 39,42-48.

Comment by 木匠Charlie — March 25, 2010 @ 10:19 pm BST Mar 25,2010 |

Charlie,

Thanks for the note – it made me realise that I didn’t have Wolfgang on my blogroll.

Comment by Jonathan Lewis — March 27, 2010 @ 8:23 am BST Mar 27,2010 |

[...] 11-How to create fake height balanced histograms? Jonathan Lewis-Fake Histograms [...]

Pingback by 19/03 /2010 – 26/03/2010 « Coskan’s Approach to Oracle — May 3, 2010 @ 2:40 am BST May 3,2010 |

Hi Jonathan,

I found another way but not sure if it is supported. I took the export of the stats for the table in question and modified it by inserting new set of records for the next month and updated the column key values and retained the same frequency. I know it would be difficult, but the issues are same as far as what values will I be getting for the future.

Thanks

Nrsun

Comment by Nrsun — December 12, 2010 @ 1:36 pm BST Dec 12,2010 |

Obviously, I imported it back after data inserts and modification.

Comment by Nrsun — December 12, 2010 @ 2:21 pm BST Dec 12,2010 |

Jonathan,

Thanks for all this advanced stuff but it seems I am stuck at the step 1 itself …. the basics itself :

i. Can you not tell Oracle directly (without hacking or setting the stats) that I’d like a FH or a HBH on a column? Looks like you cant then why can you not?

ii. And also then, what (logic) makes Oracle to decide whether to create a FH or an HBH when lets I run run this :

exec dbms_stats.gather_table_stats(”,’MY_TEST’,estimate_percent=>100, method_opt=>’FOR ALL COLUMNS SIZE [AUTO|10|254]‘);

Thanks in advance. BTW, I am a little too late in ordering your book CBO Fundamentals.. probably about 6 years late – just ordered last week :-) and even though I find some of the stuff a bit old now but it still is sooo interesting that I am just loving it – its great to build a solid base!

Comment by PD — May 13, 2011 @ 11:21 am BST May 13,2011 |

Jonathan, I got my answer thanks – it was hidden in the documentation in a line which I was overlooking everytime I was reading thru it (in a rush!). Thanks anyway.

Comment by PD — May 31, 2011 @ 5:18 pm BST May 31,2011 |

PD,

Sorry about the delay in the reply – presumably the bit you missed was that you can’t tell Oracle whether or not to do a frequency histogram or a height-balanced, it depends on the number of distinct values, the number of buckets you ask for, and the luck (good or bad) of the sample that you use.

Comment by Jonathan Lewis — June 2, 2011 @ 8:44 pm BST Jun 2,2011 |

In the prior article you added the postscript that, starting in 10.2.0.4, the CBO ignores the density if a frequency histogram is available. Is the same true for height-balanced histograms?

Comment by Jason Bucata — October 20, 2011 @ 10:34 pm BST Oct 20,2011 |

Jason,

Yes. Historically (and as described in Cost Based Oracle – Fundamentals) Oracle until 10.2.0.3 (inclusive) does some complicate arithmetic to generate the histogram and the density.

In 10.2.0.4, the density is ignored even for height-balanced histograms (at least, for ‘column = {constant}’) and Oracle uses a formula of the form:

(total number of rows – number of rows for popular values) / (total number of distinct values – number of popular values).

There is a little oddity here, because the number of rows attributed to a popular value is the number of rows per bucket multiplied by the (notional) number of endpoints for the value.

I’ll have to write up an example some day – but I think Alberto Dell’Era has a big write-up in his

(though he may call it the 11g New Density, the thing I’ve described applies to 10.2.0.5).investigationsUpdate Dec 2011 in response tothe formula above calculates thecomment below:cardinalityof the number of rows returned by a query of the form“column = constant”,not theselectivity.Comment by Jonathan Lewis — November 13, 2011 @ 3:04 pm BST Nov 13,2011 |

So does that mean, then, that if I’m faking up a histogram, either frequency or height-balanced, it’s OK to leave density NULL? That DBMS_STATS will either calculate a decent value, or even if it doesn’t it won’t matter since the CBO won’t use it?

Comment by Jason Bucata — November 14, 2011 @ 3:23 pm BST Nov 14,2011 |

Jason,

I wouldn’t be 100% sure of that. It’s possible, for example, that there are cases where the optimizer code hasn’t been made completely consistent. I would still prefer to set a density to “the figure I want Oracle to believe” even if I thought it should be ignored.

Comment by Jonathan Lewis — November 22, 2011 @ 8:21 am BST Nov 22,2011

Hi Jonathan

I am using 11.2 and from the formula (total number of rows – number of rows for popular values) / (total number of distinct values – number of popular values). I tried to calculate the selectivity for my query but not getting the correct value as depicted in the 10053 trace. Below is my 10053 trace output:

after putting values in the formula, the selectivity comes out to be (20297719- (20297719/254)*80)/(20297719-1)=0.6850394. Though the selectivity depicted in the trace is

Please tell me where I am wrong in my calculation.

Thanks

Amit

Comment by Amit — December 30, 2011 @ 6:53 pm BST Dec 30,2011 |

Amit,

I have corrected

that you are using as the source of the formula. When I wrote it, I was thinking aboutthe comment abovecardinalityrather thanselectivity(and theselectivityfor‘column = constant’is the value that the optimizer reports as theNewDensity).But even when you know what the formula is really supposed to give you, you still get a strange answer – and that’s because there is something very wrong with your statistics. Note that the table has 20,297,719 rows and the column has 20,297,719 distinct values. Despite this, the histogram claims that there are 80 buckets out of 254 that all represent the same value … which means the NDV ought to be roughly 2/3 of the size that it is.

I note, then, that your index is called

UI_Col1– which suggests you have a naming convention that prefixes unique indexes with the lettersUI, and that means something even more bizarre is happening with the stats. Given the path is aRangeScan, of course, theix_selin theAccess pathcalculation doesn’t have to have anything to do with theNewDensityfrom the table stats. On the other hand the value of 0.31 (combined with the 80 buckets of popular values) suggests that your query may be targetting a range that includes the value that Oracle thinks is the popular value (viz, a little less than one third of the rows … 80/254 = 0.3149).At this point I think I’m going to guess that

Col1is a character based column of more than 32 bytes and a large number of rows have a value that starts with the same 32 bytes. I say this because it’s one way to reproduce your anomaly – the code to calculate theNDVcan count the actual distinct values, the code to generate the histogram limits itself (with some complications) to the first 32 bytes.I’ll try to write up a short note to demonstrate the NewDensity calculation in a couple of days time.

Comment by Jonathan Lewis — December 31, 2011 @ 3:34 pm BST Dec 31,2011

Thanks Jonathan for the response.My actual problem is the same that you pointed.Because of same characters for the first 32 bytes,my histograms are reporting skewed data and calculating incorrect cardinality.I have posted this problem in separate forum, here I wanted to understand the selectivity/cardinality calculation so did not mention the issue but you spotted that :)

My query is doing FTS because it is getting the cardinality as 1/3 of records from the table and Index RangeScan for other values in no-popular buckets. But I am trying to understand:

1. Why it is reporting NewDensity:0.000000

2. The selectivity optimizer is trying to calculate is (80-0.5)/254=0.312992 which is calculating the cardinality as 20297719*0.312992=6353026.22 but I am not sure why it is trying to deduct 0.5 from the popular bucket count. I got the same results for other 10053 trace files for this query

3. And the actual problem, should there be no histogram on this column or how to deal with such kind of histogram issue. Predicate used in the query (with skewed histogram) is not a unique column, the access path shown is for RangeScan on other unique index column UI_COL1 which the optimizer is trying to evaluate for other strings not having 32 chars

Comment by Amit — January 2, 2012 @ 7:35 pm BST Jan 2,2012 |

Amit,

1. Why it is reporting NewDensity:0.000000You’ve got more than 2M distinct values according to the stats; what’s 1/2,000,000 ?

2. The selectivity optimizer is trying to calculate is (80-0.5)/254=0.312992 which is calculating the cardinality as 20297719*0.312992=6353026.22 but I am not sure why it is trying to deduct 0.5 from the popular bucket count. I got the same results for other 10053 trace files for this queryI made the mistake of assuming the column stats and the index access path information you gave me were somehow related – I suppose I should have guessed that an index named UI_Col1 didn’t contain a column named Col2 but I rather discounted that possibility as a typo because I didn’t think your columns were really called Col1 and Col2 anyway. Why would the number of popular buckets for one column have anything to do with the index selectivity of an index which doesn’t include that column ?

3. And the actual problem, should there be no histogram on this column or how to deal with such kind of histogram issue. Predicate used in the query (with skewed histogram) is not a unique column, the access path shown is for RangeScanon other uniqueindex column UI_COL1 which the optimizer is trying to evaluate for other strings not having 32 charsI’m going to guess that there are two predicates (at least), one on Col1 and one on Col2. But how I’m supposed to work out an answer to the question when you haven’t given me any clues about what the predicates look like, or what the index looks like ?

Comment by Jonathan Lewis — January 3, 2012 @ 8:01 pm BST Jan 3,2012 |

Hi Jonathan,

My apologies for providing incorrect and incomplete information. Naming convention of the index did not use this column name and I did not notice the index used in the query actually contains this column. You are right this is unique index and which contains three columns (the leading column in the index is the one having histograms issue). I verified the info from the index and columns and actually data in the column is not skew because the distinct_keys in the index and num_distinct in the column are approximately the same as below, the other columns have only 2 and 0 distinct values

As you mentioned because first 32 chars in the column values are same so it was reporting incorrect histograms.I decided to delete histograms on this column and now the query is using the Index scan in place FTS.

Thanks for your inputs.

Amit

Comment by Amit — January 20, 2012 @ 4:06 pm BST Jan 20,2012 |

[...] recent comment on a note I wrote some time ago about faking histograms asked about the calculations of selectivity in the latest versions of Oracle. As I read the [...]

Pingback by NewDensity « Oracle Scratchpad — January 3, 2012 @ 5:56 pm BST Jan 3,2012 |

[...] Yes – here’s one example. [...]

Pingback by Usage Stats « Oracle Scratchpad — January 24, 2013 @ 7:00 pm BST Jan 24,2013 |

[…] though, don’t forget that I also pointed out that sometimes you may still need to create “fake” histograms to get the best possible […]

Pingback by Webinar questions | Oracle Scratchpad — June 14, 2013 @ 4:41 pm BST Jun 14,2013 |

[…] is to use method_opt => ‘for all columns size 1′ (i.e. no histograms) and then run scripts to create the histograms I want. This means that after any stats collection I need to run code that […]

Pingback by Delete Histogram | Oracle Scratchpad — May 1, 2014 @ 1:27 pm BST May 1,2014 |