Regarding your explanation,

” … and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values … ”

“… the stats collection would report that there were only two distinct values in the table …” –

what puzzles me here was that the sampling mechanism seems to be accurate enough to get the (possibly) correct number of distinct values, but then constructs the histogram in another way (so it even looked “on purpose” to me…).

Right now I’ve had the possibility to get the “real counts” at the same time as performing the statistics gathering on a clone of the production database, and picking a column with a substantial difference between num_buckets and num_distinct I get e.g.:

sqlplus> select mycol, count(1) from myschema.mytable group by mycol order by mycol;

mycol COUNT(1)

——————— ———-

1 1072

2 4

3 15334

6 7536

7 3315

10 473

11 61

12 124

20 900

42 1

50 979

55 1

62 2

71 25619

82 5141

83 8708

84 1224

85 1429

99 3518

116 1807

118 8

119 28

126 2142

128 1

129 3324

139 431

69223

sqlplus> select endpoint_value, endpoint_number from dba_tab_histograms where owner=’myschema’ and column_name=’mycol’ order by endpoint_value;

ENDPOINT_VALUE ENDPOINT_NUMBER

————– —————

1 38

3 626

6 900

7 996

10 1006

11 1007

12 1009

20 1032

50 1068

71 1979

82 2172

83 2503

84 2549

85 2603

99 2729

116 2779

126 2865

129 2983

139 3005

sqlplus> select column_name, histogram, num_nulls, num_buckets, num_distinct, abs(num_distinct – num_buckets) diff, sample_size, last_analyzed from dba_tab_col_statistics where histogram in (‘FREQUENCY’,

‘HEIGHT BALANCED’) and owner=’myschema’ and table_name=’mytable’ order by histogram, diff desc, column_name; 2

COLUMN_NAME HISTOGRAM NUM_NULLS NUM_BUCKETS NUM_DISTINCT DIFF SAMPLE_SIZE LAST_ANALYZED

—————————— ————— ———- ———– ———— ———- ———– ——————–

…

mycol FREQUENCY 69223 19 26 7 3005 09-NOV-2010 10:39:48

sqlplus> select count(*) from myschema.mytable where mycol is not null;

COUNT(*)

———-

83182

I unfortunately don’t have the time to pursue this further with the other columns now, but I wonder might there be some algorithm like “if the count of a value is less than e.g. total_count / sample_size (or some proportion of this), don’t build a bucket for it” – here for example, the “most frequent” value that got no bucket has count 28, and the ratio total_count / sample_size is 27.68… (of course this is not really much data for a guess yet :-;)

Thanks again,

Sigrid

It’s a good question, as it’s in the right technical area for this blog, suitably general in content, and one that’s commonly asked and addressed on the web.

You didn’t give me the details of a specific example – but I’d guess that the problem lies in sampling. If you have a relatively small number of distinct values that cover MOST of the data and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values.

A recent client had a column where two values covered about 10 million rows in a table, leaving a couple of hundred rows for the remaining five or six values. From time to time the stats collection would report that there were only two distinct values in the table – and it rarely managed to report every single value. It’s cases like this that you might want to write a program to create and fix some representative stats.

]]>I don’t know if it’s okay to ask a question that’s rather “peripherally related” here, but this being one of the more “basic” ones among your posts related to frequency histograms (and as I don’t find any related documentation on the net) I’ll just try :-;

Looking at dba_tab_col_statistics, I find lots of frequency histograms where num_buckets is (even substantially) lower than num_distinct (in 11.1.0). However, in the literature (also in Oracle’s Performing Tuning Guide) it always seems to say that the distinctive feature of a frequency histogram is that there’s one bucket per distinct value…

Am I getting this totally wrong here? Or might there be a (seldom mentioned?) algorithm that lets Oracle skip some values (perhaps being too infrequent?)

Thanks a lot in advance

Sigrid

Ignore my previous comment – it was just too early in the morning – you’re right, we need to floor() or trunc() the value.

]]>select to_char(floor(2454894.89011574), ‘FM999999999′) val from dual;

VAL

———-

2454894

select to_char(2454894.89011574,’FM99999999′) char_val from dual;

CHAR_VAL

———

2454895

According to the documentation:

“All number format models cause the number to be rounded to the specified number of significant digits.”

I wonder what settings effect this rounding or is it an outright bug. I checked Metalink and could find nothing relevant. For what it’s worth, when I do a “show” of my NLS parameters, the only ones set are:

nls_language AMERICAN

nls_length_semantics BYTE

nls_nchar_conv_excp FALSE

nls_territory AMERICA

Strange – are you sure you’re not seeing the effects of the default format for dates on your system ? Putting the value from your example into my formula (selecting from dual) gives me the date and time that I expect to see.

]]>Had a problem with the expression above. For example, if you had the number 2454894.89011574 there appears to be some rounding to 2454895. I used the following and got the right values:

to_date(floor(endpoint_value) || ‘.’ ||

to_char(86400 * MOD(endpoint_value, 1), ‘FM999999999′)

,’J.sssss’)

Thanks for your work on this subject.

]]>This was my error (and a common error in the Oracle world) – I forgot to generalise the principle. I ran up a quick data set to test the code and used something that generated ‘date-only’ values, but the column (and histogram) could hold ‘date and time’ values.

In the case of ‘date and time’, the time component is stored as the fraction of a day, for example, 6:00 pm on 29th Sept 2010 would be stored as: 2455469.75

This is why your format model ran into error 1481 – it didn’t match the input value. The following expression should work:

to_date( to_char(endpoint_value,'FM99999999') || '.' || to_char(86400 * mod(endpoint_value,1),'FM99999'), 'J.sssss' ) ep_value

This splits the value into a day part and fraction of day, multiplies the fraction by the number of seconds in a day, converts both bits to character (with no spaces), concatenates them with a ‘.’ in the middle, and then uses the ‘Julian’ and ‘seconds’ conversion format.

]]>