Then I stopped at the correct point. I had the intention of going into end biased (compressed) histograms used by DB2 and V-optimal (maxdiff) histograms used by MSSQL server which I have mentioned in passing in the prose.

I will leave them to the references section where readers can find further details of modern histogram research since I still have to finish my power point presentation.

After hotsos, I will update the paper with details on these histograms also.

]]>Thanks for the link. I can’t help wondering how you found the time to do so much testing and writing. It’s an enormous piece of work.

]]>Above is the link to a first draft paper I have written on histograms. I am presenting this hotsos 2014. This paper deals with top-n histograms and hybrid histograms. I go into details of implementation and algorithms used to compute these histograms.

Amit

]]>Thanks for the link.

I wonder if any of the readers will be inspired to try re-implementing the algorithm in pl/sql – perhaps native compilation of pl/sql would give a little edge to the performance.

Jonathan,

Above is the github link. I have uploaded java implementation of Approximate NDV algorithm and top-n frequency estimation. There is one main java file where you can change the query and jdbc details to gather statistics and top-n frequency histogram. It outputs the NDV estimates and top-n frequencies along with their rowids.

I am working on porting it to C or C++ to see if I can make it more efficient. Currently it takes about 8-9 minutes for about 70-80 million records which is approximately 3-4 times slower than dbms_stats. But oracle’s implementation just injects a row source and is implemented much closer to the data so I don’t think I can match their performance but I am trying to improve it still.

If we can replace ROWID with another key like cluster indexed key in sql server, this can be used with any database that has a jdbc driver.

Amit

]]>Thanks for the information – I’d be interested, and I’m sure a number of readers would like to learn more.

I won’t be at Hotsos this (2014) year – but I’ll bet you get a good audience for the paper.

I have been able to finish up a piece of java code which takes a query as input and runs through approximate NDV and count sketch algorithms to compute the number of distinct values and TOP-N (N is a input parameter) frequency count. It also reports the error percent for each estimate.

This code has nothing to do with Oracle, it can be run against any database for which there is a JDBC driver available. I will be putting it on github. If you think it will be helpful to others I can put a github link here.

I am also in process of writing a paper detailing the TOP-N algorithms which I am presenting at hotsos 2014.

Amit.

]]>