In part 2 of this mini-series I’ll be describing the new mechanism for the simple frequency histogram and the logic of the Top-N frequency histogram. In part 3 I’ll be looking at the new hybrid histogram.
You need to know about the approximate NDV before you start examining the 12c implementation of the frequency and top-frequency histograms – but there’s a thumbnail sketch at the end of the posting if you need a quick reminder.
Simple Frequency Histograms
To allow for collection of simple frequency histogram – record the first rowid for each hash value generated and count the number of times the hash value is generated. If, by the end of the table you have no more than the requested (default 254, max 2,000 [correction: 2,048]) distinct hash values you can look up the actual values with a query by rowid.
Outstanding question – what if two values hash to the same result ? I can’t tell until I get two values that give the same hash; but I assume the code must check for collisions, so perhaps Oracle keeps a linked list with each hash value showing the original value (with its first rowid) so that in the extremely unlikely event of a collision it can keep a sample rowid for the second (and subsequent) values, and keep separate counts.
Note that the work done to generate a simple histogram requires extra memory and a little extra CPU than the basic approximate NDV – but the marginal cost is very small compared to the cost of the tablescan that is required to generate the approximate NDV.
Top-N Frequency Histograms
If the number of hash values exceeds the number of requested buckets Oracle would traditionally have to fall back to a “height-balanced” histogram; but in 12c Oracle can recognise the special case where MOST of the data falls into a relatively small number of buckets leaving a small amount of data that spans a large number of distinct values. If we call the small amount of data the “non-popular” data, then “small” means the unpopular data will fit into a single bucket.
To give a concrete example – assume you have 100,000 rows in a table, and in one column you have 5,000 distinct values, where 95 of those values account for more than 99,000 of the rows; if you ask for a histogram of 100 buckets, Oracle will check to see if the top 100 [corrected from 99 – see comment] popular values account for more than 99/100ths of the data; if you ask for a histogram of 200 buckets Oracle will check to see if the top 200 [corrected from 199] popular values account for more than 199/200ths of the data. If the non-popular data accounts for less than one “average” bucket then Oracle will create a frequency histogram that holds the low value, the high value, and N-2 other values, which will be the most popular values.
The low and high values are needed for “out of range” checks for predicates, and if they are non-popular values then Oracle does a little bit of fudging around the edges to include them. (They are available from the standard min/max parts of the stats gathering query – but Oracle seems to record them in the histogram with a count of 1 if they aren’t popular values).
If the total number of distinct values is less than the 16,384 that Oracle uses as its limit for the approximate NDV then including the count for each hash value allows it to sort the frequencies in order and decide whether a small enough number of hash values covers a large enough percentage of the data to allow for the production of a Top-N histogram.
Outstanding Question – If the total number of distinct values is greater than the 16,384 that Oracle uses as its limit, then the hash table will have split (at least once). But Oracle can still check the sorted counts of the hash values it has got to see if a small enough number of hash values covers a large enough percentage of the WHOLE data set, because the basic stats gathering query includes a count of the rows in the table. However, I think that something more sophisticated than this is going on because I ran a number of tests that should (probably) have discarded half the popular data – and yet Oracle managed to produce a Top-N histogram for every test where it was appropriate.
Again the time to produce the histogram is similar to the time needed to do tablescan that generates the min, max, and count values for the column under the basic approximate NDV strategy. The important thing about this type of histogram is that in many cases it’s likely to be much more accurate and stable than the height-balanced histogram you would have got with earlier versions of Oracle.
Interestingly, I’ve been saying for many years that if you want stability when your data has a large number of distinct values then the best strategy is to code up(with dbms_stats.set_index_stats et. al.) a frequency histogram that captures the most popular 252 values, the low value, the high value, and a fake density (or a fake for the lowest frequency you create that allows Oracle to model the non-popular data correctly).
Although simple frequency histograms can be more accurate because they allow up to 2,000 buckets (though you probably don’t need that many in most cases), and although Top-N frequency histograms are likely to be much better than height-balanced histograms – don’t forget that your choice of WHEN to create the histogram can be the most important part of histogram collection. There’s no point in collecting a histogram at 10:00 pm that tells you about a status column if the “interesting” data in that column appears between 9:00 am and 5:00 pm and has all disappeared by 6:00 pm – your histogram would be an accurate picture of the wrong data and could therefore lead to unsuitable execution plans during the day.
Approximate NDV for a column in a nutshell (for a full description see Amit Poddar’s paper): for every single row in the table (we do not use a sample for this), hash the stored value to something between 0 and 2^64. When the number of different hash values recorded reaches 16,384 discard half the hash values based on the setting of bottom bit. Continue, but keeping only half the hash values, until you again have recorded 16,384 different values, discard half the hash values based on the setting of the bottom 2 bits. Repeat until you get to the end of table. Count the number of different hash values recorded and double once for each bottom bit you’ve discarded.