Oracle Scratchpad

May 28, 2009

Frequency Histograms

Filed under: CBO,Histograms,Statistics,Tuning — Jonathan Lewis @ 7:34 pm GMT May 28,2009

In a recent article on changes to the calculation of cardinality with frequency histograms I made a comment about writing programs to create fake but realistic and complete frequency histograms to work around the problems of dramatic changes in execution plan that can occur if you let Oracle create the histograms by gathering stats.

As a follow-up to that comment, here’s a little script to demonstrate the method (for a numeric column), and a few points to remember when using this method.

We start with a very simple table, and collect statistics – excluding histograms:

create table t1 as
select
	rownum		n1
from	all_objects
where	rownum <= 10000
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 1'
	);
end;
/

The following anonymous pl/sql block then changes the column stats to add a histogram to the column – adjusting the density at the same time.

declare

	m_distcnt		number;
	m_density		number;
	m_nullcnt		number;
	srec			dbms_stats.statrec;
	m_avgclen		number;

	n_array                 dbms_stats.numarray;

begin

	m_distcnt	:= 100;
	m_density	:= 0.02;
	m_nullcnt	:= 0;
	m_avgclen	:= 3;

	n_array		:= dbms_stats.numarray(75, 81,  88, 91,   99);
	srec.bkvals	:= dbms_stats.numarray( 2, 40, 200,  3,  755);
	srec.epc	:= 5;

	dbms_stats.prepare_column_values(srec, n_array);

	dbms_stats.set_column_stats(
		ownname		=> user,
		tabname		=> 't1',
		colname		=> 'n1',
		distcnt		=> m_distcnt,
		density		=> m_density,
		nullcnt		=> m_nullcnt,
		srec		=> srec,
		avgclen		=> m_avgclen
	);

end;
/

By setting the n_array, I have listed the “popular” values that I want to tell the optimizer about. Note that these values must be in sorted order, lowest first. (The dbms_stats package also has the option to specify chararray, datearray, rawarray, fltarray (float) and dblarray (double) types – the last two arriving in 10g.)

The list of frequencies in srec.bkvals (bucket values) totals 1,000, and indicates that for every 1,000 rows in the table 2 rows will have the value 75, 40 rows will have the value 81, 200 rows will have the value 88 and so on.

By setting the density to 0.02 I have then instructed the optimizer that any predicate of the type “column = value_not_in_histogram” is supposed to return 2% of the rows in the table.  (Note – for purposes of demonstration this is deliberately not consistent with the histogram I have created. Note also that this changes in 10.2.0.4 – see closing paragraphs)

Having set the statistics I can now check to see what Oracle has recorded in the data dictionary:

select
	num_distinct,
	low_value,
	high_value,
	density,
	num_nulls,
	num_buckets,
	histogram
from
	user_tab_columns
where
	table_name = 'T1'
and	column_name = 'N1'
;

NUM_DISTINCT LOW_VALUE  HIGH_VALUE  DENSITY  NUM_NULLS NUM_BUCKETS HISTOGRAM
------------ ---------- ----------- ------- ---------- ----------- ---------
         100 C14C       C164            .02          0           5 FREQUENCY

select
	endpoint_value,
	endpoint_number,
	endpoint_number - nvl(prev_number,0) frequency
from	(
	select
		endpoint_value,
		endpoint_number,
		lag(endpoint_number,1) over(
			order by endpoint_number
		) prev_number
	from
		user_tab_histograms
	where
		table_name = 'T1'
	and	column_name = 'N1'
	)
order by
	endpoint_value
;

ENDPOINT_VALUE ENDPOINT_NUMBER  FREQUENCY
-------------- --------------- ----------
            75               2          2
            81              42         40
            88             242        200
            91             245          3
            99            1000        755

A frequency histogram (originally called a “value-based” histogram by Oracle) is stored as a “cumulative frequency histogram” – but the original frequencies can be retrieved by the use of the lag() or lead() analytic functions.

Having created the histogram, we can now test its effect – in this case using 10.2.0.3. We’ll use a simple ‘select where column = constant’, using the values 81 (which is in the histogram), 85 (which isn’t) and then 75  (which is in the histogram – but is an anomaly). Here, in order, are the three outputs from dbms_xplan.display() with CPU costing disabled:

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   400 |  1200 |     4 |
|*  1 |  TABLE ACCESS FULL| T1   |   400 |  1200 |     4 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=81)

Since we have 10,000 rows in the table, and our frequency histogram said the value 81 would appear in 40 rows out of every 1,000 the estimated cardinality is 400.

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   200 |   600 |     4 |
|*  1 |  TABLE ACCESS FULL| T1   |   200 |   600 |     4 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=85)

Since 85 does not appear in the histogram the optimizer multiplies the densitry (0.02) by the number of rows in the table to get 200 as the estimated cardinality.

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   200 |   600 |     4 |
|*  1 |  TABLE ACCESS FULL| T1   |   200 |   600 |     4 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("N1"=75)

Since we have 10,000 rows in the table, and our frequency histogram said the value 75 would appear in 2 rows out of every 1,000 the estimated cardinality is 20 – except there’s a special limiting case: the cardinality cannot be less than the “missing values” cardinality.

Of course, as I pointed out in the earlier article , things change in 11g and 10.2.0.4 – the optimizer seems to ignore the density. So 11g produces a cardinality of 20 for the value 75, and a cardinality of 10 (half the minimum known) for the value 85. Then the “linear decay” rule applies for values from 100 to 124 and 74 down to 50. So when you upgrade, instead of setting the density to describe how to treat missing values, you may need to add an extra value in the histogram reporting double the cardinality you want attributed to missing values – and the fake value may have to be chosen carefully to work around the effects of the linear decay algorithm.

[Further reading on Histograms]

21 Comments »

  1. Jonathan,

    very nice demonstration. I wasn’t aware of the “missing values limiting case” in the pre-10.2.0.4 and 11.1.0.6 behaviour.

    “So 11g produces a cardinality of 20 for the value 75, and a cardinality of 10 (half the minimum known) for the value 81″

    Shouldn’t this be 85 for the “half the minimum known” case, since 81 is part of the histogram?

    Regards,
    Randolf

    Comment by Randolf Geist — May 28, 2009 @ 9:48 pm GMT May 28,2009 | Reply

  2. I don’t know exactly what the “linear decay” rule means.

    Could you explain the “linear decay” rule in a little more detail?

    Comment by adenkang — May 29, 2009 @ 3:21 pm GMT May 29,2009 | Reply

    • Adenkang,
      There is a brief reference describing the principle in the previous post, and a description (with diagram) in my book.

      As an example: assume you have a value that ranges between 0 and 100 (with limits stored by Oracle as the low_value and high_value for the column). If you run a query referencing a value outside that range Oracle 10g has a model that assumes you expect to get some data from the query, but the volume you get decreases as you move further from the known range.

      The model uses the existing range as the centre point, and assumes your data covers three times that range – so if you query for values less than -100, or values above 200, the optimizer assumes there is no data; but if you query for a value between -100 and 0, or between +100 and +200 the optimizer uses a straight-line graph from the basic selectivity down to zero to model those ranges.

      Comment by Jonathan Lewis — May 30, 2009 @ 10:34 am GMT May 30,2009 | Reply

  3. Hi Jonathan,

    What about the case when a literal is chosen outside the range but with a not equal.

    I have 10.2.0.4 database with Siebel
    7 million rows and a column with ‘Y’ for every row (i checked)

    Strange enough histograms are created by gather_stats_job.
    Dba_tab_col_statistics reports num_distinct = 1 and density 5e-8!!!

    The query which joins about 16 tables has
    the following condition:
    flag != ‘N’ and flag=:bind_variable (which contains value Y)

    The optimizer chooses and index skip scan on an index which contains the flag column.

    My guess is that because of the != ‘N’ which is outside the histograms the optimizer falls back on the density.

    Is my guess right?
    I browsed your book on histograms and not equal clauses but could not find it.

    regards Hans-Peter

    Comment by Hans-Peter — August 11, 2009 @ 6:04 am GMT Aug 11,2009 | Reply

    • Hans-Peter,

      I’m slightly surprised that I failed to mention “!=” and histograms in the book, but I guess I had to miss a few things out.

      Given you’re looking at 10.2.0.4 (which has made some changes as the previous article pointed out) this might be an area that’s changed anyway.

      I always worry that “one” is a very special value for all sorts of reasons – I think Wolfgang Breitling has found various oddities with histograms and one distinct value in the past.

      I may have some notes somewhere – I’ll take a look when I have a little time.

      Comment by Jonathan Lewis — August 11, 2009 @ 6:18 am GMT Aug 11,2009 | Reply

  4. […] path randomly (and catastrophically) because of a histogram collection problem, and suggested a short script to create a fixed representative histogram that would stop that problem re-appearing. Comments […]

    Pingback by Block size – again « Oracle Scratchpad — January 1, 2010 @ 3:01 pm GMT Jan 1,2010 | Reply

  5. Hi Jonathan

    It seems that you can force the old (pre 10.2.0.4) behavior by just getting and setting the stats for the column, without changing anything. The only effect this appears to have is to set user_stats=’YES’ on dba_tab_columns. The 10053 trace does not give any clues as to why this works, it just seems to revert to using the column density again for missing histogram values, when user_stats=’YES’.

    I don’t know whether this is a bug or the intended behavior, but it gives you a finer grain of control by allowing you to choose the columns for which you want the old or the new behavior, without having to “alter session set “_fix_control”=’5483301:off’”.

    The following test case demonstrates:

    ------------------ Start of test case.
    drop table x
    /
    
    create table x as
    select 1 code,object_name from all_objects where rownum<=1000
    union
    select 2 code,object_name from all_objects where rownum<=1000
    union
    select 4 code,object_name from all_objects where rownum<=1000
    ;
    
    exec dbms_stats.gather_table_stats(user,'X',method_opt => 'FOR ALL COLUMNS SIZE 254');
    
    set autotrace traceonly explain
    
    select * from x where code=2 -- value in histogram
    /
    
    select * from x where code=3 -- missing value
    /
    
    DECLARE
    	l_distcnt NUMBER DEFAULT NULL;
    	l_density NUMBER DEFAULT NULL;
    	l_nullcnt NUMBER DEFAULT NULL;
    	l_srec    DBMS_STATS.STATREC;
    	l_avgclen NUMBER DEFAULT NULL;
    
    BEGIN
    
    	DBMS_STATS.GET_COLUMN_STATS (
    		ownname	=>user,
    		tabname	=>'X',
    		colname	=>'CODE',
    		distcnt	=>l_distcnt,
    		density	=>l_density,
    		nullcnt	=>l_nullcnt,
    		srec	=>l_srec,
    		avgclen	=>l_avgclen
    	);
    
    	DBMS_STATS.SET_COLUMN_STATS(
    		ownname =>user,
    		tabname =>'X',
    		colname =>'CODE',
    		distcnt =>l_distcnt,
    		density =>l_density,
    		nullcnt	=>l_nullcnt,
    		srec	=>l_srec,
    		avgclen	=>l_avgclen
    	);
    
    END;
    /
    
    select * from x where code=3
    /
    
    ------------------ End of test case.
    

    Here’s the output:

    SQL> create table x as
      2  select 1 code,object_name from all_objects where rownum&lt;=1000
      3  union
      4  select 2 code,object_name from all_objects where rownum&lt;=1000
      5  union
      6  select 4 code,object_name from all_objects where rownum;=1000
      7  ;
    
    Table created.
    
    SQL>
    SQL> exec dbms_stats.gather_table_stats(user,'X',method_opt => 'FOR ALL COLUMNS SIZE 254');
    
    PL/SQL procedure successfully completed.
    
    SQL>
    SQL> set autotrace traceonly explain
    SQL>
    SQL> select * from x where code=2   -- value in histogram
      2  /
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 2941724873
    
    --------------------------------------------------------------------------
    | Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------
    |   0 | SELECT STATEMENT  |      |   992 | 16864 |     5   (0)| 00:00:01 |
    |*  1 |  TABLE ACCESS FULL| X    |   992 | 16864 |     5   (0)| 00:00:01 |
    --------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       1 - filter("CODE"=2)
    --
    --  Value in histogram, cardinality = 992
    --
    SQL>
    SQL> select * from x where code=3   -- missing value
      2  /
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 2941724873
    
    --------------------------------------------------------------------------
    | Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------
    |   0 | SELECT STATEMENT  |      |   496 |  8432 |     5   (0)| 00:00:01 |
    |*  1 |  TABLE ACCESS FULL| X    |   496 |  8432 |     5   (0)| 00:00:01 |
    --------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       1 - filter("CODE"=3)
    --
    --  Value not in histogram, cardinality = 496 i.e. half the least value, as expected 
    --  for 10.2.0.4.
    --
    
    SQL>
    SQL> DECLARE
      2       l_distcnt     NUMBER       DEFAULT NULL;
      3       l_density     NUMBER       DEFAULT NULL;
      4       l_nullcnt     NUMBER       DEFAULT NULL;
      5       l_srec        DBMS_STATS.STATREC;
      6       l_avgclen     NUMBER       DEFAULT NULL;
      7
      8    BEGIN
      9
     10    DBMS_STATS.GET_COLUMN_STATS (
     11       ownname=> user,
     12       tabname=> 'X',
     13       colname=> 'CODE',
     14       distcnt=> l_distcnt,
     15       density=> l_density,
     16       nullcnt=> l_nullcnt,
     17       srec=> l_srec,
     18       avgclen=> l_avgclen);
     19
     20    DBMS_STATS.SET_COLUMN_STATS(
     21       ownname=> user,
     22       tabname=> 'X',
     23       colname=> 'CODE',
     24       distcnt=> l_distcnt,
     25       density=> l_density,
     26       nullcnt=> l_nullcnt,
     27       srec=> l_srec,
     28       avgclen=> l_avgclen);
     29
     30    END;
     31  /
    
    PL/SQL procedure successfully completed.
    
    SQL>
    SQL> select * from x where code=3
      2  /
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 2941724873
    
    --------------------------------------------------------------------------
    | Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------
    |   0 | SELECT STATEMENT  |      |     1 |    17 |     5   (0)| 00:00:01 |
    |*  1 |  TABLE ACCESS FULL| X    |     1 |    17 |     5   (0)| 00:00:01 |
    --------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       1 - filter("CODE"=3)
    
    --
    --  Cardinality = 1 i.e. the pre 10.2.0.4 behavior.
    --
    

    Regards,
    Mike Barratt

    Comment by Mike Barratt — January 7, 2010 @ 12:28 pm GMT Jan 7,2010 | Reply

    • Mike,

      Very interesting. I wonder if it’s deliberate or an accident. Certainly worth checking in 11g. [Update: just ran your example in 11.1.0.6, and the effect does not appear there.]

      This could be the answer to the concern I raised in the last paragraph about the increased complexity of imposing a selectivity for missing values – the old method will still work.

      Comment by Jonathan Lewis — January 7, 2010 @ 9:55 pm GMT Jan 7,2010 | Reply

      • Mike, Jonathan

        I do not have any 11gR1 to test. However it seems this unsuitable behaviour (at least to my mind) is back in 11gR2 :-/

        Regards
        Pavol Babel

        Comment by Pavol Babel — April 24, 2014 @ 3:19 pm GMT Apr 24,2014 | Reply

        • Pavol,

          The posting is so old that I’d have to start reading from scratch to get up to speed on it and cross-check the stats.

          I do recall seeing an article on MoS, though, saying something about anomalous behaviour to do with the density calculation and things that can make the optimizer switch between the old and new methods – this was some time ago, though, and I don’t remember the details. It’s the type of thing I remember and look up when I hit a specific problem

          Comment by Jonathan Lewis — April 25, 2014 @ 6:09 pm GMT Apr 25,2014

  6. […] posted an example of creating a frequency histogram a few months ago – and it doesn’t take much to turn this into an example for the […]

    Pingback by Fake Histograms « Oracle Scratchpad — March 23, 2010 @ 8:49 pm GMT Mar 23,2010 | Reply

  7. […] Related reading: how to create an “artificial” frequency histogram. […]

    Pingback by Frequency Histogram 3 « Oracle Scratchpad — September 24, 2010 @ 6:47 pm GMT Sep 24,2010 | Reply

  8. […] solution, of course, is to write code to create a histogram that represents the situation as you want Oracle to see […]

    Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am GMT Oct 20,2010 | Reply

  9. […] to help the optimizer in the most truthful way possible. List partitions (in a way similar to frequency histograms) are an obvious target for this type of […]

    Pingback by Local Indexes – 2 « Oracle Scratchpad — November 14, 2010 @ 5:42 pm GMT Nov 14,2010 | Reply

  10. […] huge number of closed orders; we’d also have to create a histogram on the column (possibly by writing a simple script) so that Oracle could recognise the skewed data […]

    Pingback by FBI oddities « Oracle Scratchpad — December 16, 2010 @ 6:18 pm GMT Dec 16,2010 | Reply

  11. […] the query to make use of the virtual column, so in some cases it may be a better solution to use another trick suggested by J. Lewis: use “hand-made” histograms that provide a more realistic distribution. Of course, it […]

    Pingback by Histograms for strongly skewed columns « Oracle Diagnostician — May 23, 2012 @ 4:35 pm GMT May 23,2012 | Reply

  12. Hi Jonathan.

    There is something strange with density with 11.2.0.2 ..

    Table stats:
    #Rows: 1000000 #Blks: 47414 AvgRowLen: 330.00 ChainCnt: 0.00

    I have a column with only one distinct value.
    AvgLen: 3 NDV: 1 Nulls: 0 Density: 1.000000

    Now, after i’ve updated 1000 rows to null, the density dropped down .
    AvgLen: 3 NDV: 1 Nulls: 1000 Density: 0.000001
    Histogram: Freq #Bkts: 1 UncompBkts: 5452 EndPtVals: 1

    any thoughts about this?

    Comment by Sagi Zvi — October 27, 2012 @ 5:28 pm GMT Oct 27,2012 | Reply

    • Sagi,

      First you need to work out what you did to get these results – then you can explain how I could reproduce the example.
      As it is you haven’t told me how you collected the statistics, or what the table looks like, or how it’s indexed, or what query you ran. Moreover, you haven’t given me any assurance that the pieces of the trace file are equivalent, so you may have omitted lines from the first sample that you’ve shown in the second.

      As far as “strange” is concerned – I think it’s possible that you collected stats the first time in a way that got 100% sample and no histogram, and in the second case generated a histogram off a small sample – although I would have expected 5e-7 for the second density. However, I don’t see the “NewDensity” figures, so I’m not sure you haven’t been running with optimizer_features_enable set to a lower value.

      Comment by Jonathan Lewis — December 13, 2012 @ 3:01 pm GMT Dec 13,2012 | Reply

  13. Thanks for the article!
    How to interpret endpoint_value for non-numeric columns?
    I have a column with frequency histogram on varchar column and this puzzles me.
    E.g. 4.4134523297546E35 has to be translated into a 1-2 character string somehow.
    Regards

    Comment by dautkhanov — October 2, 2013 @ 7:19 pm GMT Oct 2,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 4,522 other followers