Oracle Scratchpad

October 9, 2013

12c Histograms pt.3

Filed under: 12c,Histograms,Oracle,Statistics — Jonathan Lewis @ 8:13 pm BST Oct 9,2013

It has taken much longer than I anticipated to get around to writing part 3 of this mini-series on what Oracle has done about histograms in 12c.
In part 1 I gave a thumbnail sketch of the three types of histogram available in 12c
In part 2 I described in some detail the improvements in performance and accuracy for the frequency and top-frequency histograms

In part 3 of this mini-series I’ll be describing how the implementation of the “hybrid” histogram that Oracle produces if the “approximate NDV” mechanism has been enabled and you’ve left the estimate_percent to auto_sample_size. There is little difference between the work needed to create a hybrid histogram and the work needed to generate the old “height-balanced” histogram, but the degree of information captured by the hybrid is much greater than that of the height-balanced.

Starting data set

Here’s a list of 100 numbers, holding 37 distinct values, which I’m going to use to show how Oracle generates the two types of histogram.


  8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20
 21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28
 28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32
 32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36
 37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

You’ll note that I’ve sorted the data into order – this is (in effect) the first step of generating either type of histogram, and a reason why Oracle typically selects a small sample of the raw data as the basis for the histogram: sorting is expensive. Inagine, then, that I’ve asked Oracle to create a histogram with 20 buckets. Given that there are 37 distinct values I don’t have enough buckets for a frequency histogram so pre-12c Oracle would have to create a height-balanced histogram.

Since we want 20 buckets, and have 100 rows, Oracle would calculate the bucket size as 5, and walk through the sorted sample extracting every 5th bucket, numbering them from 1 to 20; since the first value in the sample (8) doesn’t match the first bucket “endpoint value” (13) Oracle would also select that and number it as 0. So our selection now looks like this:


 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20    (endpoint_number)
 8 13 17 19 20 23 26 27 28 29 29 31 32 33 34 35 36 38 41 43 59    (endpoint_value)

You’ll notice, however, that the value 29 appears twice in our list – this is significant in two way: any value that appears more than once in the selection is recognised as a “popular” value, and Oracle doesn’t store the duplications it stores only the highest endpoint_number for that value. To take advantage of the histogram Oracle “counts buckets” when producing cardinality estimates for popular values, and “factors out” the popular values when producing cardinality estimates for non-popular values.

I won’t go into details of cardinality calculations here – the only point I want to make in this article is about the quality of the histogram. Oracle has identified 29, with 6 appearances, as the ONLY popular value, but when we look at the raw data we can see that 27 and 28 also appear 6 time, 35 appears 7 times, and 33 appears 8 times. Being spotted as “popular” has an enormous element of luck attached. From day to day a small change in the data, or a small change in the sample, could result in a significant change in plans because different values are recognised as the popular ones despite a relatively insignificant change in the data.

Hybrid Histograms

The new 12c histogram is labelled as a hybrid because it contains elements of a frequency histogram as well as the “bucket” approach of the height-balanced. We start with a sort operation, and we do count buckets (and again the bucket size will be “number of rows” / “number of buckets” – but we allow the buckets to be variable size. It would be nice to have a mechanism for giving a dynamic demonstration – which is what I have when I do a presentation on this topic – but it’s not possible on this blog format, so I’ll just have to talk my way through it. Here’s the sorted list of numbers again:


  8 12 12 13 (13) [13] 15 16 16 17 (18) [18] 19 19 19 20 (20) 20 20 [20]
 21 22 22 22  23   23  24 24 25 26  26   26  27 27 27 27  27  27 28  28
 28 28 28 28  29   29  29 29 29 29  30   30  30 31 31 31  31  31 32  32
 32 33 33 33  33   33  33 33 33 34  34   34  35 35 35 35  35  35 35  36
 37 38 38 38  38   38  39 39 40 41  42   42  43 43 43 44  45  46 50  59

We’re counting in fives, so the first value we pick is the (parenthesised) 13. but the next value along is also a 13, so we move the endpoint along by one (to the [bracketed] 13). Next we count another five places, to reach the 18 in (brackets), and again find that we need to move the endpoint a bit further to the 18 in [brackets]. Again we count on five places to reach the 20 in brackets, and keep going to the end of the 20′s.

So far we’ve picked up an extra 13 in the first bucket, an extra 18 in the second bucket, and three extra 20′s in the third bucket – so our buckets are allowed to be different sizes (with a minimum of 5, so far) and you can appreciate that by the time we get to the end of the sample something a bit funny is going to happen because we’re going to have very little data left to spread between the last few buckets.

There are two things to note – we did something special about the highest value in each bucket – so it would be nice to capture some information about that value – we have variable sized buckets – so we need to know something about the sizes of those buckets, we can’t just label them bucket 1, bucket 2 and so on. Here are the first few rows of the information that Oracle keeps for this histogram:

select
        endpoint_number,
        endpoint_value,
        endpoint_repeat_count
from
        user_tab_histograms
where
        table_name = 'T1'
order by
        endpoint_number
;

       EPN       EPV            REP
 --------- --------- --------------
         1         8              1
         6        13              3
        12        18              2
        20        20              5    ***
        26        23              2
        32        26              3
        38        27              6
        44        28              6
        50        29              6
        58        31              5
        69        33              8
        79        35              7
        86        38              5
...

You’ll notice that the endpoint_numbers are actually just counting their way through the sample (notice that, as before, we’ve also captured the first item (8) in the sample to show that there’s more to the first bucket than just the 13′s). Oracle has also added the endpoint_repeat_count to the histogram information. Combining these two pieces of information we can look at the 4th line of the output, comparing it with the 3rd line, and say:
The bucket size is 8 … (20 – 12)
The high value in the bucket is 20 (the fact that the endpoint number and endpoint value were the same is a coincidence)
The number of 20′s in the bucket is 5

Using this strategy – which basically is about the same level of resource usage as the height-balanced strategy, it’s mostly about selecting and sorting a sample – we get far more information about our sample. Notice in particular how many “popular” values are obvious in my output: 27, 28, 29, 33, and 35 ALL show up as having more than a single bucket’s worth of data; and we even capture 20 and 38 with exactly one bucket’s worth of data. Our ability to produce a larger number of reasonable cardinality estimates is vastly improved.

Not only that, because we have variable bucket sizes and the number of rows of the high value for each bucket, we also have a better estimate of the data distribution of “non-popular” values within buckets. Again I’m not going to go into all the details of the arithmetic the optimizer can do, but I hope you can appreciate the increase in information that the hybrid histogram has made available.

SQL Bits

You can, of course, enable SQL tracing while collecting stats to see what SQL the code produces. I’ve done this just to back up my claim that the basic cost of the height-balanced and hybrid histograms are very similar. Here are a couple of samples of the SQL used, height-balanced first:

select
		min(minbkt),maxbkt,
		substrb(dump(min(val),16,0,32),1,120) minval,
		substrb(dump(max(val),16,0,32),1,120) maxval,
		sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv,
		sum(case when rep=1 then 1 else 0 end) unqrep
from	(
		select
			val, min(bkt) minbkt, max(bkt) maxbkt,
			count(val) rep, count(val)*count(val) repsq
		from	(
			select /*+ lots of hints  */
				"LN100" val, ntile(200) over (order by "LN100") bkt
			from	sys.ora_temp_1_ds_616 t
			where 	"LN100" is not null
			)
		group by val
		)
group by maxbkt
order by maxbkt

select
        substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv,
        (sum(pop) over()) popcnt, (sum(pop * freq) over()) popfreq,
        substrb(dump(max(val) over(),16,0,64),1,20) maxval,
        substrb(dump(min(val) over(),16,0,64),1,20) minval
from    (
        select
              val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv,
              (case when freq > ((sum(freq) over())/15)  then 1  else 0 end) pop
        from  (
              select  /*+ lots of hints */
                      "VALUE"  val, count("VALUE") freq
              from
                      "TEST_USER"."T1" t
              where
                      "VALUE" is not null
              group by
                      "VALUE"
              )
        )
order by val
/

As you can see, the code is very different, but you can probably appreciate that the two workloads largely revolve around sorting and aggregate the sample in different ways. (These examples didn’t come from the data set shown in the rest of the article by the way, and the first example comes from a case where the dbms_stats procedures had copied a sample of the original table into a global temporary table called before doing the complicated bit.)

3 Comments »

  1. […] Part 3 – Hybrid Histograms […]

    Pingback by Histograms | Oracle Scratchpad — October 16, 2013 @ 5:22 pm BST Oct 16,2013 | Reply

  2. […] and “height balanced” ones [posts by Jonathan Lewis: part 1, part 2, part 3 (with a very clear example of hybrid […]

    Pingback by DB Oriented — January 4, 2014 @ 2:38 pm BST Jan 4,2014 | Reply

  3. Thanks for your very clear description of hybrid histograms!
    It seems obvious that hybrid histograms will deliver better information to the optimzer than traditional height-balanced. BUT: I have to set “ESTIMATE_PERCENT” to “AUTO_SAMPLE_SIZE”. In 11g, Oracle uses a sample of around 5500 rows while creating the histogram. This can lead “wrong” histograms and result in bad execution plans.
    Has this been improved in 12c?
    Would you recommend using “AUTO_SAMPLE_SIZE” to get the benefit of hybrid histograms?

    Comment by Reiner — January 7, 2014 @ 11:11 am BST Jan 7,2014 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,507 other followers