Oracle Scratchpad

October 9, 2013

12c Histograms pt.3

Filed under: 12c,Histograms,Oracle,Statistics — Jonathan Lewis @ 8:13 pm BST Oct 9,2013

It has taken much longer than I anticipated to get around to writing part 3 of this mini-series on what Oracle has done about histograms in 12c.
In part 1 I gave a thumbnail sketch of the three types of histogram available in 12c
In part 2 I described in some detail the improvements in performance and accuracy for the frequency and top-frequency histograms

In part 3 of this mini-series I’ll be describing how the implementation of the “hybrid” histogram that Oracle produces if the “approximate NDV” mechanism has been enabled and you’ve left the estimate_percent to auto_sample_size. There is little difference between the work needed to create a hybrid histogram and the work needed to generate the old “height-balanced” histogram, but the degree of information captured by the hybrid is much greater than that of the height-balanced.

Starting data set

Here’s a list of 100 numbers, holding 37 distinct values, which I’m going to use to show how Oracle generates the two types of histogram.


  8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20
 21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28
 28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32
 32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36
 37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

You’ll note that I’ve sorted the data into order – this is (in effect) the first step of generating either type of histogram, and a reason why Oracle typically selects a small sample of the raw data as the basis for the histogram: sorting is expensive. Inagine, then, that I’ve asked Oracle to create a histogram with 20 buckets. Given that there are 37 distinct values I don’t have enough buckets for a frequency histogram so pre-12c Oracle would have to create a height-balanced histogram.

Since we want 20 buckets, and have 100 rows, Oracle would calculate the bucket size as 5, and walk through the sorted sample extracting every 5th bucket, numbering them from 1 to 20; since the first value in the sample (8) doesn’t match the first bucket “endpoint value” (13) Oracle would also select that and number it as 0. So our selection now looks like this:


 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20    (endpoint_number)
 8 13 17 19 20 23 26 27 28 29 29 31 32 33 34 35 36 38 41 43 59    (endpoint_value)

You’ll notice, however, that the value 29 appears twice in our list – this is significant in two way: any value that appears more than once in the selection is recognised as a “popular” value, and Oracle doesn’t store the duplications it stores only the highest endpoint_number for that value. To take advantage of the histogram Oracle “counts buckets” when producing cardinality estimates for popular values, and “factors out” the popular values when producing cardinality estimates for non-popular values.

I won’t go into details of cardinality calculations here – the only point I want to make in this article is about the quality of the histogram. Oracle has identified 29, with 6 appearances, as the ONLY popular value, but when we look at the raw data we can see that 27 and 28 also appear 6 time, 35 appears 7 times, and 33 appears 8 times. Being spotted as “popular” has an enormous element of luck attached. From day to day a small change in the data, or a small change in the sample, could result in a significant change in plans because different values are recognised as the popular ones despite a relatively insignificant change in the data.

Hybrid Histograms

The new 12c histogram is labelled as a hybrid because it contains elements of a frequency histogram as well as the “bucket” approach of the height-balanced. We start with a sort operation, and we do count buckets (and again the bucket size will be “number of rows” / “number of buckets” – but we allow the buckets to be variable size. It would be nice to have a mechanism for giving a dynamic demonstration – which is what I have when I do a presentation on this topic – but it’s not possible on this blog format, so I’ll just have to talk my way through it. Here’s the sorted list of numbers again:


  8 12 12 13 (13) [13] 15 16 16 17 (18) [18] 19 19 19 20 (20) 20 20 [20]
 21 22 22 22  23   23  24 24 25 26  26   26  27 27 27 27  27  27 28  28
 28 28 28 28  29   29  29 29 29 29  30   30  30 31 31 31  31  31 32  32
 32 33 33 33  33   33  33 33 33 34  34   34  35 35 35 35  35  35 35  36
 37 38 38 38  38   38  39 39 40 41  42   42  43 43 43 44  45  46 50  59

We’re counting in fives, so the first value we pick is the (parenthesised) 13. but the next value along is also a 13, so we move the endpoint along by one (to the [bracketed] 13). Next we count another five places, to reach the 18 in (brackets), and again find that we need to move the endpoint a bit further to the 18 in [brackets]. Again we count on five places to reach the 20 in brackets, and keep going to the end of the 20’s.

So far we’ve picked up an extra 13 in the first bucket, an extra 18 in the second bucket, and three extra 20’s in the third bucket – so our buckets are allowed to be different sizes (with a minimum of 5, so far) and you can appreciate that by the time we get to the end of the sample something a bit funny is going to happen because we’re going to have very little data left to spread between the last few buckets.

There are two things to note – we did something special about the highest value in each bucket – so it would be nice to capture some information about that value – we have variable sized buckets – so we need to know something about the sizes of those buckets, we can’t just label them bucket 1, bucket 2 and so on. Here are the first few rows of the information that Oracle keeps for this histogram:

select
        endpoint_number,
        endpoint_value,
        endpoint_repeat_count
from
        user_tab_histograms
where
        table_name = 'T1'
order by
        endpoint_number
;

       EPN       EPV            REP
 --------- --------- --------------
         1         8              1
         6        13              3
        12        18              2
        20        20              5    ***
        26        23              2
        32        26              3
        38        27              6
        44        28              6
        50        29              6
        58        31              5
        69        33              8
        79        35              7
        86        38              5
...

You’ll notice that the endpoint_numbers are actually just counting their way through the sample (notice that, as before, we’ve also captured the first item (8) in the sample to show that there’s more to the first bucket than just the 13’s). Oracle has also added the endpoint_repeat_count to the histogram information. Combining these two pieces of information we can look at the 4th line of the output, comparing it with the 3rd line, and say:
The bucket size is 8 … (20 – 12)
The high value in the bucket is 20 (the fact that the endpoint number and endpoint value were the same is a coincidence)
The number of 20’s in the bucket is 5

Using this strategy – which basically is about the same level of resource usage as the height-balanced strategy, it’s mostly about selecting and sorting a sample – we get far more information about our sample. Notice in particular how many “popular” values are obvious in my output: 27, 28, 29, 33, and 35 ALL show up as having more than a single bucket’s worth of data; and we even capture 20 and 38 with exactly one bucket’s worth of data. Our ability to produce a larger number of reasonable cardinality estimates is vastly improved.

Not only that, because we have variable bucket sizes and the number of rows of the high value for each bucket, we also have a better estimate of the data distribution of “non-popular” values within buckets. Again I’m not going to go into all the details of the arithmetic the optimizer can do, but I hope you can appreciate the increase in information that the hybrid histogram has made available.

SQL Bits

You can, of course, enable SQL tracing while collecting stats to see what SQL the code produces. I’ve done this just to back up my claim that the basic cost of the height-balanced and hybrid histograms are very similar. Here are a couple of samples of the SQL used, height-balanced first:

select
		min(minbkt),maxbkt,
		substrb(dump(min(val),16,0,32),1,120) minval,
		substrb(dump(max(val),16,0,32),1,120) maxval,
		sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv,
		sum(case when rep=1 then 1 else 0 end) unqrep
from	(
		select
			val, min(bkt) minbkt, max(bkt) maxbkt,
			count(val) rep, count(val)*count(val) repsq
		from	(
			select /*+ lots of hints  */
				"LN100" val, ntile(200) over (order by "LN100") bkt
			from	sys.ora_temp_1_ds_616 t
			where 	"LN100" is not null
			)
		group by val
		)
group by maxbkt
order by maxbkt

select
        substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv,
        (sum(pop) over()) popcnt, (sum(pop * freq) over()) popfreq,
        substrb(dump(max(val) over(),16,0,64),1,20) maxval,
        substrb(dump(min(val) over(),16,0,64),1,20) minval
from    (
        select
              val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv,
              (case when freq > ((sum(freq) over())/15)  then 1  else 0 end) pop
        from  (
              select  /*+ lots of hints */
                      "VALUE"  val, count("VALUE") freq
              from
                      "TEST_USER"."T1" t
              where
                      "VALUE" is not null
              group by
                      "VALUE"
              )
        )
order by val
/

As you can see, the code is very different, but you can probably appreciate that the two workloads largely revolve around sorting and aggregate the sample in different ways. (These examples didn’t come from the data set shown in the rest of the article by the way, and the first example comes from a case where the dbms_stats procedures had copied a sample of the original table into a global temporary table called before doing the complicated bit.)

8 Comments »

  1. […] Part 3 – Hybrid Histograms […]

    Pingback by Histograms | Oracle Scratchpad — October 16, 2013 @ 5:22 pm BST Oct 16,2013 | Reply

  2. […] and “height balanced” ones [posts by Jonathan Lewis: part 1, part 2, part 3 (with a very clear example of hybrid […]

    Pingback by DB Oriented — January 4, 2014 @ 2:38 pm BST Jan 4,2014 | Reply

  3. Thanks for your very clear description of hybrid histograms!
    It seems obvious that hybrid histograms will deliver better information to the optimzer than traditional height-balanced. BUT: I have to set “ESTIMATE_PERCENT” to “AUTO_SAMPLE_SIZE”. In 11g, Oracle uses a sample of around 5500 rows while creating the histogram. This can lead “wrong” histograms and result in bad execution plans.
    Has this been improved in 12c?
    Would you recommend using “AUTO_SAMPLE_SIZE” to get the benefit of hybrid histograms?

    Comment by Reiner — January 7, 2014 @ 11:11 am BST Jan 7,2014 | Reply

  4. […] for height-balanced histograms,  and it’s not appropriate for 12c which has introduced hybrid histograms that will require me to modify my “histogram faking” code a […]

    Pingback by Adjusting Histograms | Oracle Scratchpad — July 4, 2014 @ 8:32 pm BST Jul 4,2014 | Reply

  5. Hello Jonathan,
    just a little question about Hybrid histogram. You wrote that the minimum number of values per bucket should be 5 (the bucket size) or bigger because of the variable size
    I have a simple case where Oracle chose to create a smaller bucket:

    5 5 7 7 7 7 10 12 15 15 15 20

    In this case, I got the following histogram using 3 buckets (4 buckets is eligible to Top-Frequency):

    ENDPOINT_NUMBER                ENDPOINT_VALUE              ENDPOINT_REPEAT_COUNT
    ------------------------------ --------------------------- -----------------------------------------
                                 2                           5                                         2
                                 6                           7                                         4
                                10                          20                                         1
    

    Bucket size should be 4 (12 / 3) however, the first bucket seems to contain only the two “5”.
    If we look further the second one contains four “7” and the last bucket contains one “20” with three other values (10 – 6 = 4)
    So the histogram contains only 10 values instead of 12.

    Do you have an idea how it decided to create a bucket with 2 values instead of one with 6 values (the “5” and “7”)?

    Comment by Nicolas Jardot — August 18, 2014 @ 5:01 pm BST Aug 18,2014 | Reply

    • Nicolas,

      I always expect to see a few oddities at the boundaries, and playing around with very small data sets and bucket counts is asking for oddities.

      One detail I don’t seem to have in the article is that Oracle does want to keep track of the low and high values – and I think that that’s why the bucket of 2 rows has appeared. After that I can’t explain the counting errors. I have been able to produce a couple more anomalies by adding more rows (with values between 6 and 29) to your data set and then asking for histograms with fewer buckets than distinct values. I suspect something odd can happen as Oracle decides which bucket to eliminate to allow it to introduce a bucket for the low value.

      Comment by Jonathan Lewis — August 18, 2014 @ 5:55 pm BST Aug 18,2014 | Reply

      • Small data sets make the algorithm easier to understand. In that case, it seems there is a rule I was not aware of (track the low and high values allows smaller buckets).
        And because of this rule Oracle looses information: only 10 rows in the histogram and the number of distinct values becomes wrong.
        I was just playing with a random data set: when something is wrong in real case it might be due to such anomalies even with a bigger number of rows and buckets (at the boundary between the Top-Frequency and Hybrid)

        Comment by Nicolas Jardot — August 19, 2014 @ 10:16 am BST Aug 19,2014 | Reply

        • Hi,

          We can see that ‘detail’ when tracing column statistics gathering (as exposed by http://www.pythian.com/blog/options-for-tracing-oracle-dbms_stats/) with the following:

          SQL> exec dbms_stats.set_global_prefs('TRACE',to_char(1+16));
          SQL> exec dbms_stats.gather_table_stats(ownname=>'',tabname=>'HISTOGRAM',method_opt=>'FOR COLUMNS C2 SIZE 3');
          

          it shows that the last bucket (value 15) has been removed to be replaced by the max:

          DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,10 add: Typ=2 Len=2: c1,15
          DBMS_STATS: removal_count: 1 total_nonnull_rows: 12 mnb:  3
          DBMS_STATS:  adjusted coverage: .667
          DBMS_STATS: hist_type in exec_get_topn: 2048 ndv:6 mnb:3
          DBMS_STATS: Evaluating frequency histogram for col: "C2"
          DBMS_STATS:  number of values = 4, max # of buckects = 3, pct = 100, ssize = 12
          DBMS_STATS:  Trying to convert frequency histogram to hybrid
          

          The ‘adjusted coverage’ may suggest that dbms_stats verifies that there is still a minimum coverage of top frequencies. For example if we calculate stats with only two buckets we get:

          DBMS_STATS: remove first bucket: Typ=2 Len=2: c1,8 add: Typ=2 Len=2: c1,6
          DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,10 add: Typ=2 Len=2: c1,15
          DBMS_STATS: removal_count: 2 total_nonnull_rows: 12 mnb:  2
          DBMS_STATS: Abort top-n histogram, as the addition of min/max does not preserve the minimum coverage: .166667 vs. .5
          

          Regards,
          Franck.

          Comment by @FranckPachot — August 19, 2014 @ 1:32 pm BST Aug 19,2014


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 4,013 other followers