Fake Histograms

March 23, 2010

Fake Histograms

Filed under: CBO,Histograms,Oracle,Statistics — Jonathan Lewis @ 8:49 pm GMT Mar 23,2010

I got an email a few days ago asking me how to create a “height-balanced” histogram using the dbms_stats.set_column_stats() procedure.

I posted an example of creating a frequency histogram a few months ago – and it doesn’t take much to turn this into an example for the height-balanced type; the key change apppears in three critical lines.

In the original I had the following description for the histogram:

	n_array		:= dbms_stats.numarray(75, 81,  88, 91,   99);
	srec.bkvals	:= dbms_stats.numarray( 2, 40, 200,  3,  755);
	srec.epc	:= 5;

This sets up the data for a frequency histogram with five distinct values (those shown in the n_array) with the (relative) frequencies shown in the src.bkvals array.

Here’s the change I have to make to define a height-balanced histogram with 10 buckets (which often means 11 end-points).

	n_array		:= dbms_stats.numarray(20, 60, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 4000,  10000);
	srec.bkvals	:= null;
	srec.epc	:= 11;

These figures say that the column has a low value of 20 and a high value of 10,000; and with 10 buckets each bucket represents one tenth of the available data (dba_table.num_rows – dba_tab_columns.num_nulls). Note that the value 3,000 appears six times in the n_array; this is an example of a popular value (the repetitions will be “compressed” out of the data dictionary when you query dba_tab_histograms).

Here’s a complete example so that you don’t have to cut-n-paste two articles into one. I haven’t changed the distcnt, density and nullcnt from the values I used for the frequency histogram – but they’re not the point of interest in this case. I’ve also added a couple of queries at the end of the example to show you what gets into the data dictionary, and a couple of queries against the table so you can see the impact of the histogram on the cardinality calculations.


rem
rem     Script:         c_height_hist.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2010
rem

create table t1 as
select
        rownum  n1
from    all_objects
where
        rownum <= 10000  --> comment to avoid wordpress format issue
;

declare

        m_distcnt               number;
        m_density               number;
        m_nullcnt               number;
        srec                    dbms_stats.statrec;
        m_avgclen               number;

        n_array                 dbms_stats.numarray;

begin

        m_distcnt       := 100;
        m_density       := 0.02;
        m_nullcnt       := 0;
        m_avgclen       := 3;

        n_array         := dbms_stats.numarray(20, 60, 2000, 3000, 3000, 3000, 3000, 3000, 3000, 4000,  10000);
        srec.bkvals     := null;
        srec.epc        := 11;

        dbms_stats.prepare_column_values(srec, n_array);

        dbms_stats.set_column_stats(
                ownname         => user,
                tabname         => 't1',
                colname         => 'n1',
                distcnt         => m_distcnt,
                density         => m_density,
                nullcnt         => m_nullcnt,
                srec            => srec,
                avgclen         => m_avgclen
        ); 

end;
/

select
	num_distinct,
	low_value,
	high_value,
	density,
	num_nulls,
	num_buckets,
	histogram
from
	user_tab_columns
where
	table_name = 'T1'
and	column_name = 'N1'
;

select
	endpoint_value,
	endpoint_number,
	lag(endpoint_number,1) over(
		order by endpoint_number
	) prev_number
from
	user_tab_histograms
where
	table_name = 'T1'
and	column_name = 'N1'
order by
	endpoint_value
;

set autotrace traceonly explain

select * from t1 where n1 = 3000;
select * from t1 where n1 between 7500 and 8000;

set autotrace off

And here are the results from the queries at the end of the example, when run against 10.2.0.3.


NUM_DISTINCT LOW_VALUE                HIGH_VALUE                  DENSITY  NUM_NULLS NUM_BUCKETS HISTOGRAM
------------ ------------------------ ------------------------ ---------- ---------- ----------- ---------------
         100 C115                     C302                            .02          0          10 HEIGHT BALANCED

1 row selected.

ENDPOINT_VALUE ENDPOINT_NUMBER PREV_NUMBER
-------------- --------------- -----------
            20               0
            60               1           0
          2000               2           1
          3000               8           2
          4000               9           8
         10000              10           9

6 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |  6000 | 18000 |     4 |
|*  1 |  TABLE ACCESS FULL| T1   |  6000 | 18000 |     4 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=3000)

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    83 |   249 |     4 |
|*  1 |  TABLE ACCESS FULL| T1   |    83 |   249 |     4 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1">=7500 AND "N1"<=8000)

As you can see, all the statistics I supplied have been written into the database, and the histogram figures show the “storage compression” effect on the popular 3,000. (Ignore the “prev_number” column in that output, it’s only relevent for frequency histograms).

The first query (n1 = 3000) shows that the optimizer has picked up the 6 buckets that end at 3,000 to get a cardinality of 6,000 of the 10,000 rows; and the range predicate (n1 between 7500 and 8000) covers a range of 500 from the last bucket which has a range of 6,000 (viz: 4000 – 10000) gives it a cardinality of 1,000 * 500/6,000 = 83.3

[Further reading on Histograms]

Comments (24)

24 Comments »

Dear Sir,

how CBO calculate the cadinality of 6000 for n1=3000

Could you please expalin ?

Many thanks

Comment by Henish — March 24, 2010 @ 7:12 pm GMT Mar 24,2010 | Reply
- Henish,
  
  There are 6 buckets which end with 3,000 – and each bucket represents 1,000 rows.
  It would appear that (in this version) Oracle is simply assuming that each bucket that references a popular value (i.e. a value that appears as an end-point more than once) is full of that value. 6 * 1,000 = 6,000.
  
  Comment by Jonathan Lewis — March 26, 2010 @ 9:32 pm GMT Mar 26,2010 | Reply
This approach is also published by Wolfgang Breitling.

the paper Active Statistics

Click to access Active%20Statistics.ppt.pdf

Page 39,42-48.

Comment by 木匠Charlie — March 25, 2010 @ 10:19 pm GMT Mar 25,2010 | Reply
- Charlie,
  
  Thanks for the note – it made me realise that I didn’t have Wolfgang on my blogroll.
  
  Comment by Jonathan Lewis — March 27, 2010 @ 8:23 am GMT Mar 27,2010 | Reply
[…] 11-How to create fake height balanced histograms? Jonathan Lewis-Fake Histograms […]

Pingback by 19/03 /2010 – 26/03/2010 « Coskan’s Approach to Oracle — May 3, 2010 @ 2:40 am BST May 3,2010 | Reply
Hi Jonathan,

I found another way but not sure if it is supported. I took the export of the stats for the table in question and modified it by inserting new set of records for the next month and updated the column key values and retained the same frequency. I know it would be difficult, but the issues are same as far as what values will I be getting for the future.

Thanks
Nrsun

Comment by Nrsun — December 12, 2010 @ 1:36 pm GMT Dec 12,2010 | Reply
- Obviously, I imported it back after data inserts and modification.
  
  Comment by Nrsun — December 12, 2010 @ 2:21 pm GMT Dec 12,2010 | Reply
Jonathan,

Thanks for all this advanced stuff but it seems I am stuck at the step 1 itself …. the basics itself :

i. Can you not tell Oracle directly (without hacking or setting the stats) that I’d like a FH or a HBH on a column? Looks like you cant then why can you not?
ii. And also then, what (logic) makes Oracle to decide whether to create a FH or an HBH when lets I run run this :

exec dbms_stats.gather_table_stats(”,’MY_TEST’,estimate_percent=>100, method_opt=>’FOR ALL COLUMNS SIZE [AUTO|10|254]’);

Thanks in advance. BTW, I am a little too late in ordering your book CBO Fundamentals.. probably about 6 years late – just ordered last week :-) and even though I find some of the stuff a bit old now but it still is sooo interesting that I am just loving it – its great to build a solid base!

Comment by PD — May 13, 2011 @ 11:21 am BST May 13,2011 | Reply
- Jonathan, I got my answer thanks – it was hidden in the documentation in a line which I was overlooking everytime I was reading thru it (in a rush!). Thanks anyway.
  
  Comment by PD — May 31, 2011 @ 5:18 pm BST May 31,2011 | Reply
  - PD,
    
    Sorry about the delay in the reply – presumably the bit you missed was that you can’t tell Oracle whether or not to do a frequency histogram or a height-balanced, it depends on the number of distinct values, the number of buckets you ask for, and the luck (good or bad) of the sample that you use.
    
    Comment by Jonathan Lewis — June 2, 2011 @ 8:44 pm BST Jun 2,2011 | Reply
In the prior article you added the postscript that, starting in 10.2.0.4, the CBO ignores the density if a frequency histogram is available. Is the same true for height-balanced histograms?

Comment by Jason Bucata — October 20, 2011 @ 10:34 pm BST Oct 20,2011 | Reply
- Jason,
  
  Yes. Historically (and as described in Cost Based Oracle – Fundamentals) Oracle until 10.2.0.3 (inclusive) does some complicate arithmetic to generate the histogram and the density.
  In 10.2.0.4, the density is ignored even for height-balanced histograms (at least, for ‘column = {constant}’) and Oracle uses a formula of the form:
  
  (total number of rows – number of rows for popular values) / (total number of distinct values – number of popular values).
  
  There is a little oddity here, because the number of rows attributed to a popular value is the number of rows per bucket multiplied by the (notional) number of endpoints for the value.
  
  I’ll have to write up an example some day – but I think Alberto Dell’Era has a big write-up in his investigations (though he may call it the 11g New Density, the thing I’ve described applies to 10.2.0.5).
  
  Update Dec 2011 in response to comment below: the formula above calculates the cardinality of the number of rows returned by a query of the form “column = constant”, not the selectivity.
  
  Comment by Jonathan Lewis — November 13, 2011 @ 3:04 pm GMT Nov 13,2011 | Reply
  - So does that mean, then, that if I’m faking up a histogram, either frequency or height-balanced, it’s OK to leave density NULL? That DBMS_STATS will either calculate a decent value, or even if it doesn’t it won’t matter since the CBO won’t use it?
    
    Comment by Jason Bucata — November 14, 2011 @ 3:23 pm GMT Nov 14,2011 | Reply
    - Jason,
      
      I wouldn’t be 100% sure of that. It’s possible, for example, that there are cases where the optimizer code hasn’t been made completely consistent. I would still prefer to set a density to “the figure I want Oracle to believe” even if I thought it should be ignored.
      
      Comment by Jonathan Lewis — November 22, 2011 @ 8:21 am GMT Nov 22,2011
  - Hi Jonathan
    
    I am using 11.2 and from the formula (total number of rows – number of rows for popular values) / (total number of distinct values – number of popular values). I tried to calculate the selectivity for my query but not getting the correct value as depicted in the 10053 trace. Below is my 10053 trace output:
    
    Column (#2): NewDensity:0.000000, OldDensity:0.000000 BktCnt:254, PopBktCnt:80, PopValCnt:1, NDV:20297719 Column (#2): Col2 ( AvgLen: 36 NDV: 20297719 Nulls: 0 Density: 0.000000 Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 176 Table: Tab1 Alias: UNIT1_ Card: Original: 20297719.000000 Rounded: 6353026 Computed: 6353026.22 Non Adjusted: 6353026.22
    
    after putting values in the formula, the selectivity comes out to be (20297719- (20297719/254)*80)/(20297719-1)=0.6850394. Though the selectivity depicted in the trace is
    
    Access Path: index (RangeScan) Index: UI_Col1 resc_io: 5766117.00 resc_cpu: 43557791448 ix_sel: 0.312992 ix_sel_with_filters: 0.312992 Cost: 5772791.97 Resp: 5772791.97 Degree: 1
    
    Please tell me where I am wrong in my calculation.
    
    Thanks
    Amit
    
    Comment by Amit — December 30, 2011 @ 6:53 pm GMT Dec 30,2011 | Reply
    - Amit,
      
      I have corrected the comment above that you are using as the source of the formula. When I wrote it, I was thinking about cardinality rather than selectivity (and the selectivity for ‘column = constant’ is the value that the optimizer reports as the NewDensity).
      
      But even when you know what the formula is really supposed to give you, you still get a strange answer – and that’s because there is something very wrong with your statistics. Note that the table has 20,297,719 rows and the column has 20,297,719 distinct values. Despite this, the histogram claims that there are 80 buckets out of 254 that all represent the same value … which means the NDV ought to be roughly 2/3 of the size that it is.
      
      I note, then, that your index is called UI_Col1 – which suggests you have a naming convention that prefixes unique indexes with the letters UI, and that means something even more bizarre is happening with the stats. Given the path is a RangeScan, of course, the ix_sel in the Access path calculation doesn’t have to have anything to do with the NewDensity from the table stats. On the other hand the value of 0.31 (combined with the 80 buckets of popular values) suggests that your query may be targetting a range that includes the value that Oracle thinks is the popular value (viz, a little less than one third of the rows … 80/254 = 0.3149).
      
      At this point I think I’m going to guess that Col1 is a character based column of more than 32 bytes and a large number of rows have a value that starts with the same 32 bytes. I say this because it’s one way to reproduce your anomaly – the code to calculate the NDV can count the actual distinct values, the code to generate the histogram limits itself (with some complications) to the first 32 bytes.
      
      I’ll try to write up a short note to demonstrate the NewDensity calculation in a couple of days time.
      
      Comment by Jonathan Lewis — December 31, 2011 @ 3:34 pm GMT Dec 31,2011
Thanks Jonathan for the response.My actual problem is the same that you pointed.Because of same characters for the first 32 bytes,my histograms are reporting skewed data and calculating incorrect cardinality.I have posted this problem in separate forum, here I wanted to understand the selectivity/cardinality calculation so did not mention the issue but you spotted that :)

My query is doing FTS because it is getting the cardinality as 1/3 of records from the table and Index RangeScan for other values in no-popular buckets. But I am trying to understand:

1. Why it is reporting NewDensity:0.000000
2. The selectivity optimizer is trying to calculate is (80-0.5)/254=0.312992 which is calculating the cardinality as 20297719*0.312992=6353026.22 but I am not sure why it is trying to deduct 0.5 from the popular bucket count. I got the same results for other 10053 trace files for this query
3. And the actual problem, should there be no histogram on this column or how to deal with such kind of histogram issue. Predicate used in the query (with skewed histogram) is not a unique column, the access path shown is for RangeScan on other unique index column UI_COL1 which the optimizer is trying to evaluate for other strings not having 32 chars

Comment by Amit — January 2, 2012 @ 7:35 pm GMT Jan 2,2012 | Reply
- Amit,
  1. Why it is reporting NewDensity:0.000000
  
  You’ve got more than 2M distinct values according to the stats; what’s 1/2,000,000 ?
  
  2. The selectivity optimizer is trying to calculate is (80-0.5)/254=0.312992 which is calculating the cardinality as 20297719*0.312992=6353026.22 but I am not sure why it is trying to deduct 0.5 from the popular bucket count. I got the same results for other 10053 trace files for this query
  
  I made the mistake of assuming the column stats and the index access path information you gave me were somehow related – I suppose I should have guessed that an index named UI_Col1 didn’t contain a column named Col2 but I rather discounted that possibility as a typo because I didn’t think your columns were really called Col1 and Col2 anyway. Why would the number of popular buckets for one column have anything to do with the index selectivity of an index which doesn’t include that column ?
  
  3. And the actual problem, should there be no histogram on this column or how to deal with such kind of histogram issue. Predicate used in the query (with skewed histogram) is not a unique column, the access path shown is for RangeScan on other unique index column UI_COL1 which the optimizer is trying to evaluate for other strings not having 32 chars
  
  I’m going to guess that there are two predicates (at least), one on Col1 and one on Col2. But how I’m supposed to work out an answer to the question when you haven’t given me any clues about what the predicates look like, or what the index looks like ?
  
  Comment by Jonathan Lewis — January 3, 2012 @ 8:01 pm GMT Jan 3,2012 | Reply
  - Hi Jonathan,
    
    My apologies for providing incorrect and incomplete information. Naming convention of the index did not use this column name and I did not notice the index used in the query actually contains this column. You are right this is unique index and which contains three columns (the leading column in the index is the one having histograms issue). I verified the info from the index and columns and actually data in the column is not skew because the distinct_keys in the index and num_distinct in the column are approximately the same as below, the other columns have only 2 and 0 distinct values
    
    SQL> select column_name,num_distinct from dba_tab_cols where table_name='EVE' and column_name='IDENTITY'; COLUMN_NAME NUM_DISTINCT ------------------------------ ------------ IDENTITY 24749753 SQL> select index_name,num_rows,distinct_keys from dba_indexes where table_name='EVE' and index_name='UI_EVE'; INDEX_NAME NUM_ROWS DISTINCT_KEYS ------------------------------ ---------- ------------- UI_EVE 25670860 25670860
    
    As you mentioned because first 32 chars in the column values are same so it was reporting incorrect histograms.I decided to delete histograms on this column and now the query is using the Index scan in place FTS.
    
    Thanks for your inputs.
    
    Amit
    
    Comment by Amit — January 20, 2012 @ 4:06 pm GMT Jan 20,2012 | Reply
[…] recent comment on a note I wrote some time ago about faking histograms asked about the calculations of selectivity in the latest versions of Oracle. As I read the […]

Pingback by NewDensity « Oracle Scratchpad — January 3, 2012 @ 5:56 pm GMT Jan 3,2012 | Reply
[…] Yes – here’s one example. […]

Pingback by Usage Stats « Oracle Scratchpad — January 24, 2013 @ 7:00 pm GMT Jan 24,2013 | Reply
[…] though, don’t forget that I also pointed out that sometimes you may still need to create “fake” histograms to get the best possible […]

Pingback by Webinar questions | Oracle Scratchpad — June 14, 2013 @ 4:41 pm BST Jun 14,2013 | Reply
[…] is to use method_opt => ‘for all columns size 1′ (i.e. no histograms) and then run scripts to create the histograms I want. This means that after any stats collection I need to run code that […]

Pingback by Delete Histogram | Oracle Scratchpad — May 1, 2014 @ 1:27 pm BST May 1,2014 | Reply
[…] Faking a height-balanced histogram How to create a height-balanced histogram (using a numeric column for the example). […]

Pingback by Faking Histograms | Oracle Scratchpad — October 15, 2018 @ 1:37 pm BST Oct 15,2018 | Reply