I was going to drop a comment here but have turned it into a blog post

https://hourim.wordpress.com/2016/01/20/natural-and-adjusted-hybrid-histogram/

I hope this answers your question

Best regards

]]>We can see that ‘detail’ when tracing column statistics gathering (as exposed by http://www.pythian.com/blog/options-for-tracing-oracle-dbms_stats/) with the following:

SQL> exec dbms_stats.set_global_prefs('TRACE',to_char(1+16)); SQL> exec dbms_stats.gather_table_stats(ownname=>'',tabname=>'HISTOGRAM',method_opt=>'FOR COLUMNS C2 SIZE 3');

it shows that the last bucket (value 15) has been removed to be replaced by the max:

DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,10 add: Typ=2 Len=2: c1,15 DBMS_STATS: removal_count: 1 total_nonnull_rows: 12 mnb: 3 DBMS_STATS: adjusted coverage: .667 DBMS_STATS: hist_type in exec_get_topn: 2048 ndv:6 mnb:3 DBMS_STATS: Evaluating frequency histogram for col: "C2" DBMS_STATS: number of values = 4, max # of buckects = 3, pct = 100, ssize = 12 DBMS_STATS: Trying to convert frequency histogram to hybrid

The ‘adjusted coverage’ may suggest that dbms_stats verifies that there is still a minimum coverage of top frequencies. For example if we calculate stats with only two buckets we get:

DBMS_STATS: remove first bucket: Typ=2 Len=2: c1,8 add: Typ=2 Len=2: c1,6 DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,10 add: Typ=2 Len=2: c1,15 DBMS_STATS: removal_count: 2 total_nonnull_rows: 12 mnb: 2 DBMS_STATS: Abort top-n histogram, as the addition of min/max does not preserve the minimum coverage: .166667 vs. .5

Regards,

Franck.

And because of this rule Oracle looses information: only 10 rows in the histogram and the number of distinct values becomes wrong.

I was just playing with a random data set: when something is wrong in real case it might be due to such anomalies even with a bigger number of rows and buckets (at the boundary between the Top-Frequency and Hybrid) ]]>

I always expect to see a few oddities at the boundaries, and playing around with very small data sets and bucket counts is asking for oddities.

One detail I don’t seem to have in the article is that Oracle does want to keep track of the low and high values – and I think that that’s why the bucket of 2 rows has appeared. After that I can’t explain the counting errors. I have been able to produce a couple more anomalies by adding more rows (with values between 6 and 29) to your data set and then asking for histograms with fewer buckets than distinct values. I suspect something odd can happen as Oracle decides which bucket to eliminate to allow it to introduce a bucket for the low value.

]]>just a little question about Hybrid histogram. You wrote that the minimum number of values per bucket should be 5 (the bucket size) or bigger because of the variable size

I have a simple case where Oracle chose to create a smaller bucket:

5 5 7 7 7 7 10 12 15 15 15 20

In this case, I got the following histogram using 3 buckets (4 buckets is eligible to Top-Frequency):

ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT ------------------------------ --------------------------- ----------------------------------------- 2 5 2 6 7 4 10 20 1

Bucket size should be 4 (12 / 3) however, the first bucket seems to contain only the two “5”.

If we look further the second one contains four “7” and the last bucket contains one “20” with three other values (10 – 6 = 4)

So the histogram contains only 10 values instead of 12.

Do you have an idea how it decided to create a bucket with 2 values instead of one with 6 values (the “5” and “7”)?

]]>