It has taken much longer than I anticipated to get around to writing part 3 of this mini-series on what Oracle has done about histograms in 12c.

In * part 1* I gave a thumbnail sketch of the three types of histogram available in 12c

In

*I described in some detail the improvements in performance and accuracy for the frequency and top-frequency histograms*

**part 2**In part 3 of this mini-series I’ll be describing how the implementation of the “hybrid” histogram that Oracle produces if the “approximate NDV” mechanism has been enabled and you’ve left the ** estimate_percent **to

**. There is little difference between the work needed to create a hybrid histogram and the work needed to generate the old “height-balanced” histogram, but the degree of information captured by the hybrid is much greater than that of the height-balanced.**

*auto_sample_size*### Starting data set

Here’s a list of 100 numbers, holding 37 distinct values, which I’m going to use to show how Oracle generates the two types of histogram.

8 12 12 13 13 13 15 16 16 17 18 18 19 19 19 20 20 20 20 20 21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28 28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32 32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36 37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

You’ll note that I’ve sorted the data into order – this is (in effect) the first step of generating either type of histogram, and a reason why Oracle typically selects a small sample of the raw data as the basis for the histogram: sorting is expensive. imagine, then, that I’ve asked Oracle to create a histogram with 20 buckets. Given that there are 37 distinct values I don’t have enough buckets for a frequency histogram so pre-12c Oracle would have to create a height-balanced histogram.

Since we want 20 buckets, and have 100 rows, Oracle would calculate the bucket size as 5, and walk through the sorted sample extracting every 5th bucket, numbering them from 1 to 20; since the first value in the sample (8) doesn’t match the first bucket “endpoint value” (13) Oracle would also select that and number it as 0. So our selection now looks like this:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (endpoint_number) 8 13 17 19 20 23 26 27 28 29 29 31 32 33 34 35 36 38 41 43 59 (endpoint_value)

You’ll notice, however, that the value 29 appears twice in our list – this is significant in two ways: any value that appears more than once in the selection is recognised as a “popular” value, and Oracle doesn’t store the duplications it stores only the highest endpoint_number for that value. To take advantage of the histogram Oracle “counts buckets” when producing cardinality estimates for popular values, and “factors out” the popular values when producing cardinality estimates for non-popular values.

I won’t go into details of cardinality calculations here – the only point I want to make in this article is about the quality of the histogram. Oracle has identified 29, with 6 appearances, as the ONLY popular value, but when we look at the raw data we can see that 27 and 28 also appear 6 time, 35 appears 7 times, and 33 appears 8 times. Being spotted as “popular” has an enormous element of luck attached. From day to day a small change in the data, or a small change in the sample, could result in a significant change in plans because different values are recognised as the popular ones despite a relatively insignificant change in the data.

### Hybrid Histograms

The new 12c histogram is labelled as a hybrid because it contains elements of a frequency histogram as well as the “bucket” approach of the height-balanced. We start with a sort operation, and we do count buckets (and again the bucket size will be “number of rows” / “number of buckets” – but we allow the buckets to be variable size. It would be nice to have a mechanism for giving a dynamic demonstration – which is what I have when I do a presentation on this topic – but it’s not possible on this blog format, so I’ll just have to talk my way through it. Here’s the sorted list of numbers again:

8 12 12 13 (13) [13] 15 16 16 17 (18) [18] 19 19 19 20 (20) 20 20 [20] 21 22 22 22 23 23 24 24 25 26 26 26 27 27 27 27 27 27 28 28 28 28 28 28 29 29 29 29 29 29 30 30 30 31 31 31 31 31 32 32 32 33 33 33 33 33 33 33 33 34 34 34 35 35 35 35 35 35 35 36 37 38 38 38 38 38 39 39 40 41 42 42 43 43 43 44 45 46 50 59

We’re counting in fives, so the first value we pick is the (parenthesised) 13. but the next value along is also a 13, so we move the endpoint along by one (to the [bracketed] 13). Next we count another five places, to reach the 18 in (brackets), and again find that we need to move the endpoint a bit further to the 18 in [brackets]. Again we count on five places to reach the 20 in brackets, and keep going to the end of the 20’s.

So far we’ve picked up an extra 13 in the first bucket, an extra 18 in the second bucket, and three extra 20’s in the third bucket – so our buckets are allowed to be different sizes (with a minimum of 5, so far) and you can appreciate that by the time we get to the end of the sample something a bit funny is going to happen because we’re going to have very little data left to spread between the last few buckets.

There are two things to note – we did something special about the highest value in each bucket – so it would be nice to capture some information about that value – we have variable sized buckets – so we need to know something about the sizes of those buckets, we can’t just label them bucket 1, bucket 2 and so on. Here are the first few rows of the information that Oracle keeps for this histogram:

select endpoint_number, endpoint_value, endpoint_repeat_count from user_tab_histograms where table_name = 'T1' order by endpoint_number ; EPN EPV REP --------- --------- -------------- 1 8 1 6 13 3 12 18 2 20 20 5 *** 26 23 2 32 26 3 38 27 6 44 28 6 50 29 6 58 31 5 69 33 8 79 35 7 86 38 5 ...

You’ll notice that the endpoint_numbers are actually just counting their way through the sample (notice that, as before, we’ve also captured the first item (8) in the sample to show that there’s more to the first bucket than just the 13’s). Oracle has also added the endpoint_repeat_count to the histogram information. Combining these two pieces of information we can look at the 4th line of the output, comparing it with the 3rd line, and say:

The bucket size is 8 … (20 – 12)

The high value in the bucket is 20 (the fact that the endpoint number and endpoint value were the same is a coincidence)

The number of 20’s in the bucket is 5

Using this strategy – which runs at about the same level of resource usage as the height-balanced strategy since it’s mostly about selecting and sorting a sample – we get far more information about our sample. Notice in particular how many “popular” values are obvious in my output: 27, 28, 29, 33, and 35 ALL show up as having more than a single bucket’s worth of data; and we even capture 20 and 38 with exactly one bucket’s worth of data. Our ability to produce a larger number of reasonable cardinality estimates is vastly improved.

Not only that, because we have variable bucket sizes and the number of rows of the high value for each bucket, we also have a better estimate of the data distribution of “non-popular” values within buckets. Again I’m not going to go into all the details of the arithmetic the optimizer can do, but I hope you can appreciate the increase in information that the hybrid histogram has made available.

### SQL Bits

You can, of course, enable SQL tracing while collecting stats to see what SQL the code produces. I’ve done this just to back up my claim that the basic cost of the height-balanced and hybrid histograms are very similar. Here are a couple of samples of the SQL used, height-balanced first:

select min(minbkt),maxbkt, substrb(dump(min(val),16,0,32),1,120) minval, substrb(dump(max(val),16,0,32),1,120) maxval, sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv, sum(case when rep=1 then 1 else 0 end) unqrep from ( select val, min(bkt) minbkt, max(bkt) maxbkt, count(val) rep, count(val)*count(val) repsq from ( select /*+ lots of hints */ "LN100" val, ntile(200) over (order by "LN100") bkt from sys.ora_temp_1_ds_616 t where "LN100" is not null ) group by val ) group by maxbkt order by maxbkt select substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv, (sum(pop) over()) popcnt, (sum(pop * freq) over()) popfreq, substrb(dump(max(val) over(),16,0,64),1,20) maxval, substrb(dump(min(val) over(),16,0,64),1,20) minval from ( select val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv, (case when freq > ((sum(freq) over())/15) then 1 else 0 end) pop from ( select /*+ lots of hints */ "VALUE" val, count("VALUE") freq from "TEST_USER"."T1" t where "VALUE" is not null group by "VALUE" ) ) order by val /

As you can see, the code is very different, but you can probably appreciate that the two workloads largely revolve around sorting and aggregate the sample in different ways. (These examples didn’t come from the data set shown in the rest of the article by the way, and the first example comes from a case where the dbms_stats procedures had copied a sample of the original table into a global temporary table called before doing the complicated bit.)

### Update Jan 2018

There is a problem with the way in which Oracle handles the high-value when creating a hybrid (or Top-N) histogram – it replaces the top bucket with a bucket holding the high-value for the column. This can have a huge impact on execution plans if the a very popular value was the highest value captured in the sample. There is a separate blog note on this problem – the problem will be fixed in 12.2 with a probable backport to 12.1.