12c Histograms pt.3

October 9, 2013

12c Histograms pt.3

Filed under: 12c,Histograms,Oracle,Statistics — Jonathan Lewis @ 8:13 pm BST Oct 9,2013

It has taken longer than I anticipated to get around to writing part 3 of this mini-series on what Oracle has done about histograms in 12c.

In part 1 I gave a thumbnail sketch of the three types of histogram available in 12c
In part 2 I described in some detail the improvements in performance and accuracy for the frequency and top-frequency histograms

In this installment of the mini-series I’ll be describing the implementation of the hybrid histogram that Oracle produces if two conditions are met:

the approximate NDV (number of distinct values) mechanism has been enabled
you’ve set / left the estimate_percent to auto_sample_size.

There is little difference between the work needed to create a hybrid histogram and the work needed to generate the old height-balanced histogram, but the quality of information captured by the hybrid is much greater than that of the height-balanced.

Starting data set

Here’s a list of 100 numbers, holding 37 distinct values, which I’m going to use to show how Oracle generates the two types of histogram.

 
.  8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20
  21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28
  28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32
  32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36
  37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

You’ll note that I’ve sorted the data into order – this is (in effect) the first step of generating either type of histogram, and a reason why Oracle typically selects a small sample of the raw data as the basis for the histogram: sorting is expensive. Imagine, then, that I’ve asked Oracle to create a histogram with 20 buckets. Given that there are 37 distinct values I don’t have enough buckets for a frequency histogram so pre-12c Oracle would have to create a height-balanced histogram.

Since we want 20 buckets, and have 100 rows, Oracle would calculate the bucket size as 5, and walk through the sorted sample extracting every 5th bucket, numbering them from 1 to 20; since the first value in the sample (8) doesn’t match the first bucket “endpoint value” (13) Oracle would also select that and number it as 0. So our selection now looks like this:


  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20    (endpoint_number)
  8 13 17 19 20 23 26 27 28 29 29 31 32 33 34 35 36 38 41 43 59    (endpoint_value)

You’ll notice, however, that the value 29 appears twice in our list – this is significant in two ways: any value that appears more than once in the selection is recognised as a “popular” value, and Oracle doesn’t store the duplicated values it stores only the highest endpoint_number for that value. To take advantage of the histogram Oracle “counts buckets” when producing cardinality estimates for popular values, and “factors out” the popular values when producing cardinality estimates for non-popular values.

I won’t go into details of cardinality calculations here – the only point I want to make in this article is about the quality of the histogram. Oracle has identified 29, with 6 appearances, as the ONLY popular value, but when we look at the raw data we can see that 27 and 28 also appear 6 time, 35 appears 7 times, and 33 appears 8 times. Being spotted as “popular” has an enormous element of luck attached. From day to day a small change in the data, or a small change in the sample, could result in a significant change in plans because different values are recognised as the popular ones despite a relatively insignificant change in the data.

Hybrid Histograms

The new 12c histogram is labelled as a hybrid because it contains elements of a frequency histogram as well as the “bucket” approach of the height-balanced. We start with a sort operation, and we do count buckets (and again the bucket size will be “number of rows” / “number of buckets” – but we allow the buckets to be variable size. It would be nice to have a mechanism for giving a dynamic demonstration – which is what I have when I do a presentation on this topic – but it’s not possible on this blog format, so I’ll just have to talk my way through it. Here’s the sorted list of numbers again:

.  8 12 12 13 (13) [13] 15 16 16 17 (18) [18] 19 19 19 20 (20) 20 20 [20]
  21 22 22 22  23   23  24 24 25 26  26   26  27 27 27 27  27  27 28  28
  28 28 28 28  29   29  29 29 29 29  30   30  30 31 31 31  31  31 32  32
  32 33 33 33  33   33  33 33 33 34  34   34  35 35 35 35  35  35 35  36
  37 38 38 38  38   38  39 39 40 41  42   42  43 43 43 44  45  46 50  59

We’re counting in fives, so the first value we pick is the (parenthesised) 13. but the next value along is also a 13, so we move the endpoint along by one (to the [bracketed] 13). Next we count another five places, to reach the 18 in (parentheses), and again find that we need to move the endpoint a bit further to the 18 in [brackets]. Again we count on five places to reach the 20 in parentheses, and keep going to the end of the 20’s.

So far we’ve picked up an extra 13 in the first bucket, an extra 18 in the second bucket, and three extra 20’s in the third bucket – so our buckets are allowed to be different sizes (with a minimum of 5, so far) and you can appreciate that by the time we get to the end of the sample something a bit funny is going to happen because we’re going to have very little data left to spread between the last few buckets.

There are two things to note – we did something special about the highest value in each bucket – so it would be nice to capture some information about that value – we have variable sized buckets – so we need to know something about the sizes of those buckets, we can’t just label them bucket 1, bucket 2 and so on. Here are the first few rows of the information that Oracle keeps for this histogram:

select
        endpoint_number,
        endpoint_value,
        endpoint_repeat_count
from
        user_tab_histograms
where
        table_name = 'T1'
order by
        endpoint_number
;

       EPN       EPV            REP
 --------- --------- --------------
         1         8              1
         6        13              3
        12        18              2
        20        20              5    ***
        26        23              2
        32        26              3
        38        27              6
        44        28              6
        50        29              6
        58        31              5
        69        33              8
        79        35              7
        86        38              5
...

You’ll notice that the values in endpoint_number are actually just counting their way through the sample (notice that, as before, we’ve also captured the first item (8) in the sample to show that there’s more to the first bucket than just the 13’s). Oracle has also added the endpoint_repeat_count to the histogram information. Combining these two pieces of information we can look at the 4th line of the output, comparing it with the 3rd line, and say:

The bucket size is 8 … (20 – 12)
The high value in the bucket is 20 (the fact that the endpoint number and endpoint value were the same is a coincidence)
The number of 20’s in the bucket is 5

Using this strategy – which runs at about the same level of resource usage as the height-balanced strategy since it’s mostly about selecting and sorting a sample – we get far more information about our sample. Notice in particular how many “popular” values are obvious in my output: 27, 28, 29, 33, and 35 ALL show up as having more than a single bucket’s worth of data; and we even capture 20 and 38 with exactly one bucket’s worth of data. Our ability to produce a larger number of reasonable cardinality estimates is vastly improved.

Not only that, because we have variable bucket sizes and the number of rows of the high value for each bucket, we also have a better estimate of the data distribution of “non-popular” values within buckets. Again I’m not going to go into all the details of the arithmetic the optimizer can do, but I hope you can appreciate the increase in information that the hybrid histogram has made available.

SQL Bits

You can, of course, enable SQL tracing while collecting stats to see what SQL the code produces. I’ve done this just to back up my claim that the basic cost of the height-balanced and hybrid histograms are very similar. Here are a couple of samples of the SQL used, height-balanced first:

select
		min(minbkt),maxbkt,
		substrb(dump(min(val),16,0,32),1,120) minval,
		substrb(dump(max(val),16,0,32),1,120) maxval,
		sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv,
		sum(case when rep=1 then 1 else 0 end) unqrep
from	(
		select
			val, min(bkt) minbkt, max(bkt) maxbkt,
			count(val) rep, count(val)*count(val) repsq
		from	(
			select /*+ lots of hints  */
				"LN100" val, ntile(200) over (order by "LN100") bkt
			from	sys.ora_temp_1_ds_616 t
			where 	"LN100" is not null
			)
		group by val
		)
group by maxbkt
order by maxbkt

select
        substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv,
        (sum(pop) over()) popcnt, (sum(pop * freq) over()) popfreq,
        substrb(dump(max(val) over(),16,0,64),1,20) maxval,
        substrb(dump(min(val) over(),16,0,64),1,20) minval
from    (
        select
              val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv,
              (case when freq > ((sum(freq) over())/15)  then 1  else 0 end) pop
        from  (
              select  /*+ lots of hints */
                      "VALUE"  val, count("VALUE") freq
              from
                      "TEST_USER"."T1" t
              where
                      "VALUE" is not null
              group by
                      "VALUE"
              )
        )
order by val
/

As you can see, the code is very different, but you can probably appreciate that the two workloads largely revolve around sorting and aggregate the sample in different ways. (These examples didn’t come from the data set shown in the rest of the article by the way, and the first example comes from a case where the dbms_stats procedures had copied a sample of the original table into a global temporary table created as part of the sampling process.)

Update Jan 2018

There is a problem with the way in which Oracle handles the high-value when creating a hybrid (or Top-N) histogram – it replaces the top bucket with a bucket holding the high-value for the column. This can have a huge impact on execution plans if the a very popular value was the highest value captured in the sample. There is a separate blog note on this problem – the problem will be fixed in 12.2 with a probable backport to 12.1.

Comments (21)

21 Comments »

[…] Part 3 – Hybrid Histograms […]

Pingback by Histograms | Oracle Scratchpad — October 16, 2013 @ 5:22 pm BST Oct 16,2013 | Reply
[…] and “height balanced” ones [posts by Jonathan Lewis: part 1, part 2, part 3 (with a very clear example of hybrid […]

Pingback by DB Oriented — January 4, 2014 @ 2:38 pm GMT Jan 4,2014 | Reply
Thanks for your very clear description of hybrid histograms!
It seems obvious that hybrid histograms will deliver better information to the optimzer than traditional height-balanced. BUT: I have to set “ESTIMATE_PERCENT” to “AUTO_SAMPLE_SIZE”. In 11g, Oracle uses a sample of around 5500 rows while creating the histogram. This can lead “wrong” histograms and result in bad execution plans.
Has this been improved in 12c?
Would you recommend using “AUTO_SAMPLE_SIZE” to get the benefit of hybrid histograms?

Comment by Reiner — January 7, 2014 @ 11:11 am GMT Jan 7,2014 | Reply
[…] for height-balanced histograms, and it’s not appropriate for 12c which has introduced hybrid histograms that will require me to modify my “histogram faking” code a […]

Pingback by Adjusting Histograms | Oracle Scratchpad — July 4, 2014 @ 8:32 pm BST Jul 4,2014 | Reply
Hello Jonathan,
just a little question about Hybrid histogram. You wrote that the minimum number of values per bucket should be 5 (the bucket size) or bigger because of the variable size
I have a simple case where Oracle chose to create a smaller bucket:

5 5 7 7 7 7 10 12 15 15 15 20

In this case, I got the following histogram using 3 buckets (4 buckets is eligible to Top-Frequency):
```
ENDPOINT_NUMBER                ENDPOINT_VALUE              ENDPOINT_REPEAT_COUNT
------------------------------ --------------------------- -----------------------------------------
                             2                           5                                         2
                             6                           7                                         4
                            10                          20                                         1
```
Bucket size should be 4 (12 / 3) however, the first bucket seems to contain only the two “5”.
If we look further the second one contains four “7” and the last bucket contains one “20” with three other values (10 – 6 = 4)
So the histogram contains only 10 values instead of 12.

Do you have an idea how it decided to create a bucket with 2 values instead of one with 6 values (the “5” and “7”)?

Comment by Nicolas Jardot — August 18, 2014 @ 5:01 pm BST Aug 18,2014 | Reply
- Nicolas,
  
  I always expect to see a few oddities at the boundaries, and playing around with very small data sets and bucket counts is asking for oddities.
  
  One detail I don’t seem to have in the article is that Oracle does want to keep track of the low and high values – and I think that that’s why the bucket of 2 rows has appeared. After that I can’t explain the counting errors. I have been able to produce a couple more anomalies by adding more rows (with values between 6 and 29) to your data set and then asking for histograms with fewer buckets than distinct values. I suspect something odd can happen as Oracle decides which bucket to eliminate to allow it to introduce a bucket for the low value.
  
  Comment by Jonathan Lewis — August 18, 2014 @ 5:55 pm BST Aug 18,2014 | Reply
  - Small data sets make the algorithm easier to understand. In that case, it seems there is a rule I was not aware of (track the low and high values allows smaller buckets).
    And because of this rule Oracle looses information: only 10 rows in the histogram and the number of distinct values becomes wrong.
    I was just playing with a random data set: when something is wrong in real case it might be due to such anomalies even with a bigger number of rows and buckets (at the boundary between the Top-Frequency and Hybrid)
    
    Comment by Nicolas Jardot — August 19, 2014 @ 10:16 am BST Aug 19,2014 | Reply
    - Hi,
      
      We can see that ‘detail’ when tracing column statistics gathering (as exposed by http://www.pythian.com/blog/options-for-tracing-oracle-dbms_stats/) with the following:
      
      SQL> exec dbms_stats.set_global_prefs('TRACE',to_char(1+16)); SQL> exec dbms_stats.gather_table_stats(ownname=>'',tabname=>'HISTOGRAM',method_opt=>'FOR COLUMNS C2 SIZE 3');
      
      it shows that the last bucket (value 15) has been removed to be replaced by the max:
      
      DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,10 add: Typ=2 Len=2: c1,15 DBMS_STATS: removal_count: 1 total_nonnull_rows: 12 mnb: 3 DBMS_STATS: adjusted coverage: .667 DBMS_STATS: hist_type in exec_get_topn: 2048 ndv:6 mnb:3 DBMS_STATS: Evaluating frequency histogram for col: "C2" DBMS_STATS: number of values = 4, max # of buckects = 3, pct = 100, ssize = 12 DBMS_STATS: Trying to convert frequency histogram to hybrid
      
      The ‘adjusted coverage’ may suggest that dbms_stats verifies that there is still a minimum coverage of top frequencies. For example if we calculate stats with only two buckets we get:
      
      DBMS_STATS: remove first bucket: Typ=2 Len=2: c1,8 add: Typ=2 Len=2: c1,6 DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,10 add: Typ=2 Len=2: c1,15 DBMS_STATS: removal_count: 2 total_nonnull_rows: 12 mnb: 2 DBMS_STATS: Abort top-n histogram, as the addition of min/max does not preserve the minimum coverage: .166667 vs. .5
      
      Regards,
      Franck.
      
      Comment by @FranckPachot — August 19, 2014 @ 1:32 pm BST Aug 19,2014
    - Nicolas,
      
      I was going to drop a comment here but have turned it into a blog post
      
      Natural and Adjusted Hybrid Histogram
      
      I hope this answers your question
      
      Best regards
      
      Comment by hourim — January 20, 2016 @ 7:20 pm GMT Jan 20,2016
[…] describe the new mechanism for the frequency histogram and the logic of the Top-N histogram, and in part 3 I’ll describe the mechanism and demonstrate the benefits of the Hybrid histogram. The […]

Pingback by 12c histograms | Oracle Scratchpad — July 8, 2015 @ 8:38 pm BST Jul 8,2015 | Reply
[…] new histogram mechanisms in 12c –the srec.rpcnts (“repeat counts”) array is used in “hybrid histograms”. It’s not relevant to my example where I’m trying to create a pure frequency histogram, but if […]

Pingback by Histogram Tip | Oracle Scratchpad — September 4, 2015 @ 8:33 am BST Sep 4,2015 | Reply
[…] cardinality in the equality case. (There is one slight difference in 12c, the histogram is a hybrid histogram, not a height-balanced […]

Pingback by Histogram Limit | Oracle Scratchpad — October 23, 2015 @ 8:03 pm BST Oct 23,2015 | Reply
[…] was going to write a comment in this Jonathan Lewis article and have finally decided to write a blog article because it turned to be a long comment. In the […]

Pingback by Natural and Adjusted Hybrid Histogram | Mohamed Houri’s Oracle Notes — January 20, 2016 @ 7:16 pm GMT Jan 20,2016 | Reply
[…] used by Oracle to calculate a Hybrid histogram, I am going to use a data set example taken from Jonathan Lewis‘ blog as shown […]

Pingback by 12c Hybrid histogram – All Things Oracle — January 26, 2016 @ 3:16 pm GMT Jan 26,2016 | Reply
[…] versions of Oracle even in the case of the new hybrid histograms (which are still sampled on a very small sample and therefore still a stability […]

Pingback by Histogram Upgrade | Oracle Scratchpad — December 2, 2016 @ 3:02 pm GMT Dec 2,2016 | Reply
[…] came across a simple performance problem recently that ended up highlighting a problem with the 12c hybrid histogram algorithm. It was a problem that I had mentioned in passing a few years ago, but only in the […]

Pingback by Histogram Hassle | Oracle Scratchpad — January 15, 2018 @ 1:01 pm GMT Jan 15,2018 | Reply
[…] use this query to model the problem. All I have to do is arrange for a data set that results in a hybrid (or height-balanced) histogram being created on the skew column, and then run the query lots of […]

Pingback by Histogram Threat | Oracle Scratchpad — January 30, 2018 @ 8:07 am GMT Jan 30,2018 | Reply
[…] collection for histograms is sampled in 11g- and still sampled for hybrid histograms in 12c – an unlucky sample produced a very misleading […]

Pingback by Comparing Plans | Oracle Scratchpad — March 12, 2018 @ 8:01 am GMT Mar 12,2018 | Reply
[…] that 12c can create “Top-N” and “hybrid” histograms – and exp/imp were written long before these new histogram types came into […]

Pingback by exp catch | Oracle Scratchpad — April 10, 2018 @ 5:52 pm BST Apr 10,2018 | Reply
[…] 12c introduced the “Hybrid” histogram – a nice addition to the available options and one that (ignoring the bug for which a patch […]

Pingback by Hybrid Fake | Oracle Scratchpad — October 10, 2018 @ 3:12 pm BST Oct 10,2018 | Reply
[…] Hybrid Histograms (Oct 2013) […]

Pingback by Histogram catalogue | Oracle Scratchpad — February 25, 2022 @ 10:03 am GMT Feb 25,2022 | Reply

RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome. Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Oracle Scratchpad

October 9, 2013