Then I stopped at the correct point. I had the intention of going into end biased (compressed) histograms used by DB2 and V-optimal (maxdiff) histograms used by MSSQL server which I have mentioned in passing in the prose.

I will leave them to the references section where readers can find further details of modern histogram research since I still have to finish my power point presentation.

After hotsos, I will update the paper with details on these histograms also.

]]>Thanks for the link. I can’t help wondering how you found the time to do so much testing and writing. It’s an enormous piece of work.

]]>Above is the link to a first draft paper I have written on histograms. I am presenting this hotsos 2014. This paper deals with top-n histograms and hybrid histograms. I go into details of implementation and algorithms used to compute these histograms.

Amit

]]>Thanks for the link.

I wonder if any of the readers will be inspired to try re-implementing the algorithm in pl/sql – perhaps native compilation of pl/sql would give a little edge to the performance.

Jonathan,

Above is the github link. I have uploaded java implementation of Approximate NDV algorithm and top-n frequency estimation. There is one main java file where you can change the query and jdbc details to gather statistics and top-n frequency histogram. It outputs the NDV estimates and top-n frequencies along with their rowids.

I am working on porting it to C or C++ to see if I can make it more efficient. Currently it takes about 8-9 minutes for about 70-80 million records which is approximately 3-4 times slower than dbms_stats. But oracle’s implementation just injects a row source and is implemented much closer to the data so I don’t think I can match their performance but I am trying to improve it still.

If we can replace ROWID with another key like cluster indexed key in sql server, this can be used with any database that has a jdbc driver.

Amit

]]>