In part 2 of this mini-series I’ll be describing the new mechanism for the simple frequency histogram and the logic of the Top-N frequency histogram. In part 3 I’ll be looking at the new hybrid histogram.

You need to know about the * approximate NDV* before you start examining the 12c implementation of the frequency and top-frequency histograms – but there’s a thumbnail sketch at the end of the posting if you need a quick reminder.

### Simple Frequency Histograms

To allow for collection of simple frequency histogram – record the first rowid for each hash value generated and count the number of times the hash value is generated. If, by the end of the table you have no more than the requested (default 254, max 2,000) distinct hash values you can look up the actual values with a query by rowid.

**Outstanding question** – what if two values hash to the same result ? I can’t tell until I get two values that give the same hash; but I assume the code must check for collisions, so perhaps Oracle keeps a linked list with each hash value showing the original value (with its first rowid) so that in the extremely unlikely event of a collision it can keep a sample rowid for the second (and subsequent) values, and keep separate counts.

Note that the work done to generate a simple histogram requires extra memory and a little extra CPU than the basic approximate NDV – but the marginal cost is very small compared to the cost of the tablescan that is required to generate the approximate NDV.

### Top-N Frequency Histograms

If the number of hash values exceeds the number of requested buckets Oracle would traditionally have to fall back to a “height-balanced” histogram; but in 12c Oracle can recognise the special case where MOST of the data falls into a relatively small number of buckets leaving a small amount of data that spans a large number of distinct values. If we call the small amount of data the “non-popular” data, then “small” means the unpopular data will fit into a single bucket.

To give a concrete example – assume you have 100,000 rows in a table, and in one column you have 5,000 distinct values, where 95 of those values account for more than 99,000 of the rows; if you ask for a histogram of 100 buckets, Oracle will check to see if the top 100 * [corrected from 99 - see comment]* popular values account for more than 99/100ths of the data; if you ask for a histogram of 200 buckets Oracle will check to see if the top 200

*popular values account for more than 199/200ths of the data. If the non-popular data accounts for less than one “average” bucket then Oracle will create a frequency histogram that holds the low value, the high value, and N-2 other values, which will be the most popular values.*

**[corrected from 199]**The low and high values are needed for “out of range” checks for predicates, and if they are non-popular values then Oracle does a little bit of fudging around the edges to include them. (They are available from the standard min/max parts of the stats gathering query – but Oracle seems to record them in the histogram with a count of 1 if they aren’t popular values).

If the total number of distinct values is less than the 16,384 that Oracle uses as its limit for the approximate NDV then including the count for each hash value allows it to sort the frequencies in order and decide whether a small enough number of hash values covers a large enough percentage of the data to allow for the production of a Top-N histogram.

**Outstanding Question** – If the total number of distinct values is greater than the 16,384 that Oracle uses as its limit, then the hash table will have split (at least once). But Oracle can still check the sorted counts of the hash values it has got to see if a small enough number of hash values covers a large enough percentage of the WHOLE data set, because the basic stats gathering query includes a count of the rows in the table. However, I think that something more sophisticated than this is going on because I ran a number of tests that should (probably) have discarded half the popular data – and yet Oracle managed to produce a Top-N histogram for every test where it was appropriate.

Again the time to produce the histogram is similar to the time needed to do tablescan that generates the min, max, and count values for the column under the basic approximate NDV strategy. The important thing about this type of histogram is that in many cases it’s likely to be much more **accurate **and **stable **than the height-balanced histogram you would have got with earlier versions of Oracle.

Interestingly, I’ve been saying for many years that if you want stability when your data has a large number of distinct values then the best strategy is to code up(with dbms_stats.set_index_stats et. al.) a frequency histogram that captures the most popular 252 values, the low value, the high value, and a fake density (or a fake for the lowest frequency you create that allows Oracle to model the non-popular data correctly).

### Warning:

Although simple frequency histograms can be more accurate because they allow up to 2,000 buckets (though you probably don’t need that many in most cases), and although Top-N frequency histograms are likely to be much better than height-balanced histograms – don’t forget that your choice of WHEN to create the histogram can be the most important part of histogram collection. There’s no point in collecting a histogram at 10:00 pm that tells you about a * status* column if the “interesting” data in that column appears between 9:00 am and 5:00 pm and has all disappeared by 6:00 pm – your histogram would be an accurate picture of the wrong data and could therefore lead to unsuitable execution plans during the day.

**Footnote:**

Approximate NDV for a column in a nutshell (for a full description see * Amit Poddar’s paper*): for

*in the table (we do not use a sample for this), hash the stored value to something between 0 and 2^64. When the number of different hash values recorded reaches 16,384 discard half the hash values based on the setting of bottom bit. Continue, but keeping only half the hash values, until you again have recorded 16,384 different values, discard half the hash values based on the setting of the bottom 2 bits. Repeat until you get to the end of table. Count the number of different hash values recorded and double once for each bottom bit you’ve discarded.*

**every single row**
“I think that something more sophisticated than this is going on because I ran a number of tests that should (probably) have discarded half the popular data – and yet Oracle managed to produce a Top-N histogram for every test where it was appropriate.”

One way to do it efficiently would be to keep the top n (number of buckets requested) rowids and frequencies always in memory during every split (this can be implemented efficiently with correct data structures and it should not take much memory since the maximum number of buckets is 2000) and use this data structure to build top-N frequency histogram and use the hash table to calculate NDV.

Also looks like they changed the APPROXIMATE NDV in the STATS step to “OPTIMIZER STATISTICS GATHERING” this signifies that this area is under active development, maybe next we will see something like this for height based/hybrid histograms.

Comment by poddar007 — July 31, 2013 @ 6:11 pm BST Jul 31,2013 |

Amit,

Excellent analysis. The key, as you say, is that the worst case is an extra 2,000 (hash) values above the 16,384 that could be held to represent the rest of the table. (I’m a little embarrassed that I didn’t think of that myself).

Comment by Jonathan Lewis — July 31, 2013 @ 6:54 pm BST Jul 31,2013 |

There are some interesting ramifications for storing synopses, though, and handling raw values that collide on their hash values.

If you find time to write another paper I’m sure everyone would love to see it.

Had you noticed, by the way, that you can now generate synopses for a simple table so that you can exchange the table with a partition of a partitioned table and exchange the two sets of synopses at the same time.

Comment by Jonathan Lewis — July 31, 2013 @ 7:12 pm BST Jul 31,2013 |

Had you noticed, by the way, that you can now generate synopses for a simple table so that you can exchange the table with a partition of a partitioned table and exchange the two sets of synopses at the same time.

Yes I saw that in the documentation. If you read the paper by gibbons(2002) which all this is based on there is a section of maintaining the synopses incrementally in batches i.e. collect the DML changes and apply it to the synopsis in batches. This raises an interesting option of maintaining the synopsis (NDV) incrementally without gathering statistics every night. This way one can maintain all the statistics including frequency histogram incrementally. The only reason to scan the whole table was that one cannot maintain NDV incrementally.

Comment by poddar007 — July 31, 2013 @ 7:25 pm BST Jul 31,2013

I just did a quick 10832 trace, it looks like oracle is doing what I suggested. It seems to maintain a table of top-n( n = number of buckets requested ) rowid and frequencies. During the read it maintains this table (it refers to this table as the rowid table allocated under top-n structures). Maintaining at most 2000 top-N values can be easily maintained efficiently. At the end of table it calculates whether frequency or top-n histogram is feasible. If yes then it returns topn-count and top-n values(rowid, frequency). If no then top n related data is not returned.

Comment by poddar007 — July 31, 2013 @ 7:20 pm BST Jul 31,2013 |

Amit,

It crossed my mind to wonder whether Oracle does a concealed “as at SCN” when it reads back the values using those rowids, or taking some other action to make the scan and the row recall read-consistent. If not then it leaves a window of opportunity for someone to update the value (or delete the row) that some of those rowids are pointing to.

Comment by Jonathan Lewis — August 3, 2013 @ 11:06 am BST Aug 3,2013

Jonathan,

It does not look like there is any hidden flash back query involved in reading values back from rowid. Only generating the TOP-N results is done in C code rest of the processing is done in PL/SQL (dbms_stats). I don’t see any special hidden flashback query mechanism involved. That said gathering statistics was never built to be 100% accurate even before older version. For example for gathering equi-depth (height balanced) histogram the process generally is

a) Collect sample of data in temporary table

b) Run query against the temp table

c) Put records in optimizer history

d) Put statistics in stats table.

Here the data can be changed after step a) in a different session.

This is more evident with oracle moving towards probabilistic algorithms for gathering statistics. Probabilistic algorithm only puts a upper bound on the error.

Comment by poddar007 — August 5, 2013 @ 3:16 am BST Aug 5,2013

Jonthan,

I said

“One way to do it efficiently would be to keep the top n (number of buckets requested) rowids and frequencies always in memory during every split”

This is incorrect, it will not work since the value with highest frequency during the split may not be the value with highest frequency over all. Moreover values with low frequencies at the time of split may end up having high frequencies over all.

Consider following data

a,a,a,b,b,b,c,a,a,b,a,b,c,c,c,c,c,c,c,c,c,c,c,c,c,c

Lets assume that we are trying to calculate top two frequencies and we will split on 3rd distinct value. On seeing the first “c” we will need to split at that point we will discard c since its frequency is 1 but if you see in the data c has maximum frequency. Sorry I jumped the gun.

This is also a well studied problem in academia. The two best algorithm out there are Count-Min sketch algorithm and Count sketch algorithm. Of this the best option seems to be Count sketch algorithm.

Please see the paper

http://dl.acm.org/citation.cfm?id=684566

10832 trace does not detail the topn calculation, so there is no way to say for sure oracle is using this, but there is a sentence in the trace file “Computing

approximatetopn from native topn structures…”. This means that oracle is using some kind of probablistic algorithm for this. The only ones I found in academic papers is the Count-Min sketch and Count Sketch algorithm.Amit

Comment by poddar007 — August 12, 2013 @ 7:56 pm BST Aug 12,2013 |

It looks like in future optimizer statistics collection will be completely done by probabilistic algorithms. There are papers that describe creating equi-depth histograms for data streams (google for bar-splitting). Its also evident since Oracle has renamed the plan step from APPROXIMATE NDV to OPTIMIZER STATISTICS COLLECTION. There are also stream algorithm to calculate F2(Second frequency moment) which is used to calculate density in case of histograms.

Comment by poddar007 — August 12, 2013 @ 9:57 pm BST Aug 12,2013

Hi Jonathan,

“Note that the work done to generate a simple histogram requires extra memory and a little extra CPU than the basic approximate NDV – but the marginal cost is very small compared to the cost of the tablescan that is required to generate the approximate NDV”

It seems like there is also another improvement by creating simple frequency based histograms in Oracle 12c. The separate histogram queries are not based on the sample anymore – Oracle keeps track of the different rowids (for each column value) and it seems like it can be grouped and counted on-the-fly without a separate SQL clause. Unfortunately i have not much spare time to research and figure out all of these things yet, but maybe you already know the details and can share them with us.

Here is the tiny test case on 10g, 11g and 12c. In all three cases a simple frequency based histogram is created for all three columns (a,b,c).

The difference between 10g R2 and 11g R2 is logical and already explained very well by NDV with no explicit count(distinct) and sum(sys_op_opnsize) operation. However in both cases the histograms are still gathered on a sample (17.7419354839) of table HTEST and a separate query is executed for every column that needs a histogram. (in my simple case with no additional sample bump up cycle of course)

In 12c no sample is used for histograms anymore. The base table is accessed by rowid (with every single column value) by each separate query for every column that needs a histogram and the count(*) and group operations also gone. Another interesting part is, that there is a comment section like “TOPN,NIL,NIL,…, U5″ by gathering basic column statistics with help of NDV, but no TOP-N frequency histogram is collected later on (it is just a simple frequency histogram).

Best Regards

Stefan

Comment by Stefan Koehler — July 31, 2013 @ 6:45 pm BST Jul 31,2013 |

Stefan.

“The separate histogram queries are not based on the sample anymore – Oracle keeps track of the different rowids (for each column value) and it seems like it can be grouped and counted on-the-fly without a separate SQL clause”That’s basically what I said – although I didn’t explain what Oracle wasn’t doing any more. (I’ve added a note to the Approximate NDV thumbnail to remind people that it uses the whole dataset, though, rather than sampling.)

Assumption: the rationale for the TOPN in the “hints” for the stats collection when you don’t get a TOP-N histogram is simply that Oracle doesn’t know whether or not it’s going to get a frequency histogram or a TOP-N histogram in advance, so the basic code path is the same for both. In your case you had 3 values, and a request for 5 buckets (visible in the U5 part of the hint) – so the code discovers that the number of distinct values is less than (a) 2,000 and (b) the number of requested buckets, and doesn’t do the extra little bits needed for a complete TOP-N.

Comment by Jonathan Lewis — July 31, 2013 @ 7:10 pm BST Jul 31,2013 |

Hi Jonathan,

“That’s basically what I said – although I didn’t explain what Oracle wasn’t doing any more.”

Oh ok, i am really sorry. I misinterpreted your statement, because i have not seen it in context to the first part of “Simple Frequency Histograms”. Sometimes i shall suppress the “Outstanding question” part first and just read through the whole sub topic. Your assumption for the “TOPN,NIL,NIL,…, U5″ part sounds also reasonable.

Once again – sorry :-))

Regards

Stefan

Comment by Stefan Koehler — August 1, 2013 @ 10:04 am BST Aug 1,2013 |

Stefan,

No need to apologise – mentioning the (heavy-duty) stuff that no longer happens is, in this case, just as valuable as describing what does happen. The absence of all those ‘create temporary table’, insert as select, aggregate steps is a significant contributor to the benefit of the new mechanisms.

Comment by Jonathan Lewis — August 3, 2013 @ 10:44 am BST Aug 3,2013

I just did a quick 10832 trace, it looks like oracle is doing what I suggested. It seems to maintain a table of top-n( n = number of buckets requested ) rowid and frequencies. During the read it maintains this table (it refers to this table as the rowid table allocated under top-n structures). Maintaining at most 2000 top-N values can be easily maintained efficiently. At the end of table it calculates whether frequency or top-n histogram is feasible. If yes then it returns topn-count and top-n values(rowid, frequency). If no then top n related data is not returned.

Comment by poddar007 — July 31, 2013 @ 7:18 pm BST Jul 31,2013 |

Jonathan,

“if you ask for a histogram of 100 values, Oracle will check to see if the top 99 popular values account for more than 99/100ths of the data; if you ask for a histogram of 200 buckets Oracle will check to see if the top 199 popular values account for more than 199/200ths of the data”

I think the above is incorrect. From the documentation and my tracing it looks like the formula is that number of rows occupied by n (number of buckets requested) popular values should at least be equal to number of rows that will fit in (n-1) buckets. i.e. (rows occupied by popular values) >= (num_rows)(1 – 1/n) (where n is number of buckets requested)

Comment by poddar007 — July 31, 2013 @ 9:41 pm BST Jul 31,2013 |

for example

Lets assume

NDV: 91848

Buckets: 254

Total Rows: 3367553

Total frequency for top 254 popular values: 3278132

If we apply the logic you mentioned we get

3278132 accounts for (254/91848)*3367553 = 9313 (So we should get top-n histogram)

But oracle does not create top-n histogram since 3278132 < 3367553 (1 – 1/254)

Comment by poddar007 — July 31, 2013 @ 9:49 pm BST Jul 31,2013 |

Amit,

If we apply what I said (including the “bucket count -1″ error) the only scale factor involved would be 253/254, I don’t see where I’ve said anything that would imply the need for a scale factor of 254/91848.

Comment by Jonathan Lewis — August 3, 2013 @ 10:41 am BST Aug 3,2013 |

Amit,

You’re right, I did make an error (which I’ve now corrected in the text), where I say “top 99 popular value” and “top 199 popular values”. That should be top 100 and top 200 respectively.

The final formula you’ve given equates to:

(rows occupied by non-popular values) < (num_rows)/n, which is the formal statement my informal description:“small” means the unpopular data will fit into a single bucketComment by Jonathan Lewis — August 3, 2013 @ 10:37 am BST Aug 3,2013 |

Jonathan,

I have been able to finish up a piece of java code which takes a query as input and runs through approximate NDV and count sketch algorithms to compute the number of distinct values and TOP-N (N is a input parameter) frequency count. It also reports the error percent for each estimate.

This code has nothing to do with Oracle, it can be run against any database for which there is a JDBC driver available. I will be putting it on github. If you think it will be helpful to others I can put a github link here.

I am also in process of writing a paper detailing the TOP-N algorithms which I am presenting at hotsos 2014.

Amit.

Comment by Amit Poddar — December 24, 2013 @ 4:08 pm BST Dec 24,2013 |

Amit,

Thanks for the information – I’d be interested, and I’m sure a number of readers would like to learn more.

I won’t be at Hotsos this (2014) year – but I’ll bet you get a good audience for the paper.

Comment by Jonathan Lewis — December 31, 2013 @ 12:50 pm BST Dec 31,2013

https://github.com/amitpoddar/Sketch

Jonathan,

Above is the github link. I have uploaded java implementation of Approximate NDV algorithm and top-n frequency estimation. There is one main java file where you can change the query and jdbc details to gather statistics and top-n frequency histogram. It outputs the NDV estimates and top-n frequencies along with their rowids.

I am working on porting it to C or C++ to see if I can make it more efficient. Currently it takes about 8-9 minutes for about 70-80 million records which is approximately 3-4 times slower than dbms_stats. But oracle’s implementation just injects a row source and is implemented much closer to the data so I don’t think I can match their performance but I am trying to improve it still.

If we can replace ROWID with another key like cluster indexed key in sql server, this can be used with any database that has a jdbc driver.

Amit

Comment by poddar007 — January 16, 2014 @ 2:03 am BST Jan 16,2014

Amit,

Thanks for the link.

I wonder if any of the readers will be inspired to try re-implementing the algorithm in pl/sql – perhaps native compilation of pl/sql would give a little edge to the performance.

Comment by Jonathan Lewis — January 16, 2014 @ 8:41 am BST Jan 16,2014

Hi,

https://skydrive.live.com/redir?resid=C2278BDC0EF5D792!311&authkey=!APGj3vMVNLDQWbo&ithint=file%2c.docx

Above is the link to a first draft paper I have written on histograms. I am presenting this hotsos 2014. This paper deals with top-n histograms and hybrid histograms. I go into details of implementation and algorithms used to compute these histograms.

Amit

Comment by amitpoddar — February 9, 2014 @ 4:29 am BST Feb 9,2014

Amit,

Thanks for the link. I can’t help wondering how you found the time to do so much testing and writing. It’s an enormous piece of work.

Comment by Jonathan Lewis — February 9, 2014 @ 6:48 pm BST Feb 9,2014

“It’s an enormous piece of work”

Then I stopped at the correct point. I had the intention of going into end biased (compressed) histograms used by DB2 and V-optimal (maxdiff) histograms used by MSSQL server which I have mentioned in passing in the prose.

I will leave them to the references section where readers can find further details of modern histogram research since I still have to finish my power point presentation.

After hotsos, I will update the paper with details on these histograms also.

Comment by amitpoddar — February 9, 2014 @ 8:26 pm BST Feb 9,2014

Reblogged this on Thoughts from James H. Lui.

Comment by jhlui1 — September 24, 2013 @ 7:00 pm BST Sep 24,2013 |

[…] in 12c. In part 1 I gave a thumbnail sketch of the three types of histogram available in 12c In part 2 I described in some detail the improvements in performance and accuracy for the frequency and […]

Pingback by 12c Histograms pt.3 | Oracle Scratchpad — October 9, 2013 @ 8:14 pm BST Oct 9,2013 |

[…] Part 2 – Frequency histograms and Top-N histograms […]

Pingback by Histograms | Oracle Scratchpad — October 16, 2013 @ 5:22 pm BST Oct 16,2013 |

[…] “frequency” and “height balanced” ones [posts by Jonathan Lewis: part 1, part 2, part 3 (with a very clear example of hybrid […]

Pingback by DB Oriented — January 4, 2014 @ 2:23 pm BST Jan 4,2014 |