I find it convenient occasionally to “translate” a frequency histogram into a report of the underlying data (sometimes to see how closely the histogram matches the real data). To demonstrate the type of query I use I’ve created a data set with a small number of distinct values and generated a frequency histogram on the data set. This is what the data and histogram look like:
SQL> desc t1 Name Null? Type ----------------------- -------- ---------- N1 NOT NULL NUMBER SQL> select n1, count(*) from t1 group by n1 order by n1; N1 COUNT(*) ---------- ---------- 10 4 20 3 30 6 40 38 50 4 60 13 70 6 80 41 90 2 100 3 SQL> select 2 endpoint_number, endpoint_value 3 from 4 user_tab_histograms 5 where 6 table_name = 'T1' 7 and column_name = 'N1' 8 order 9 by endpoint_number 10 ; ENDPOINT_NUMBER ENDPOINT_VALUE --------------- -------------- 4 10 7 20 13 30 51 40 55 50 68 60 74 70 115 80 117 90 120 100
If you look at the histogram data you’ll see that the “endpoint_value” column holds a list of the values that appear in column n1, and the “endpoint_number” is the ‘cumulative frequency’ for the appearances of the endpoint_value – the number 10 appears 4 times, the number 20 appears 7 – 4 = 3 time, the value 30 appears 13 – 7 = 6 times, and so on.
Once we’ve recognised this pattern we can write some SQL to convert the histogram back to the list of values it represents by using the analytic lag() function. I’ll demonstrate this in two steps – first generating an intermediate result to show the effect of the lag() function.
select endpoint_number, lag(endpoint_number,1) over( order by endpoint_number ) delta, endpoint_value from user_tab_histograms where table_name = 'T1' and column_name = 'N1' order by endpoint_number ; ENDPOINT_NUMBER DELTA ENDPOINT_VALUE --------------- ---------- -------------- 4 10 7 4 20 13 7 30 51 13 40 55 51 50 68 55 60 74 68 70 115 74 80 117 115 90 120 117 100
The lag(column, N) function requires an order clause in its over() clause and has the effect of reporting a value N rows later than it would otherwise appear (relative to that ordering).
As soon as we’ve got two consecutive endpoint numbers being reported as two columns in the same row we can find the difference – which gives us the answer we need:
select endpoint_value column_value, endpoint_number - nvl(prev_endpoint,0) frequency from ( select endpoint_number, lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint, endpoint_value from user_tab_histograms where table_name = 'T1' and column_name = 'N1' ) order by endpoint_number ; COLUMN_VALUE FREQUENCY ------------ ---------- 10 4 20 3 30 6 40 38 50 4 60 13 70 6 80 41 90 2 100 3
This is the code for numeric columns; for data columns you simply need to select to_date(to_char(endpoint_value,’J’)) rather than just endpoint_value. (Edit: this works for date-only values, but if you stored date and time then see the correction in comments 3 and 4 and below). Character strings, though are more of a problem in the general case – and I may talk about those in another blog.
Note, by the way, that the mechanism that Oracle uses to generate the histogram data means that the results appear in the same order whether you sort by endpoint_value or endpoint_number. (Unless – as can happen with character columns, a point I initially forget – the same endpoint_value appears in two rows: so I’ve edited the original code where I had used endpoint_value to use endpoint_number)
Mr. John
End of day, I hope you are insisting on this, with the above comparison between real and histogram if we find any vast difference, then we should perform collecting histograms post collecting the regular stats.
Hope I have understood what you have conveyed.
Regards
V.Hari
Comment by V.Hari — September 21, 2010 @ 1:35 pm BST Sep 21,2010 |
V.Hari,
I can’t make out what you are saying.. but you should atleast write the name correctly – his name is Jonathan not John.
Comment by CJ — September 28, 2010 @ 1:45 pm BST Sep 28,2010 |
V.Hari,
Rather more importantly – if this code allows you to see that the histogram doesn’t represent real life then you have a clue that you may need to write a program to generate an “artificial” histogram and stop Oracle from collecting one.
Comment by Jonathan Lewis — September 29, 2010 @ 1:57 pm BST Sep 29,2010 |
[…] an earlier blog I wrote about a query that turned the figures from a frequency histogram into a list of column […]
Pingback by Frequency Histogram 3 « Oracle Scratchpad — September 24, 2010 @ 7:13 pm BST Sep 24,2010 |
Jonathan sir,
I tried using TO_DATE(TO_CHAR(, ‘J’)) for converting endpoint_value of date fields:
SQL> select endpoint_number, to_date(to_char(endpoint_value,’J’)) ep_value
2 from dba_tab_histograms
3 where table_name = ‘TBL_BETS_FILLED’
4 and owner = ‘ODS’ and column_name = ‘PLACED_DATE’
5 and rownum = 1;
select endpoint_number, to_date(to_char(endpoint_value,’J’)) ep_value
*
ERROR at line 1:
ORA-01481: invalid number format model
Comment by CJ — September 28, 2010 @ 1:38 pm BST Sep 28,2010 |
I posted an incomplete entry above, but what I wanted to say was I receive the ‘ORA-01481: Invalid number format model’ error. Any ideas?
Thanks.
Comment by CJ — September 28, 2010 @ 1:40 pm BST Sep 28,2010 |
CJ,
This was my error (and a common error in the Oracle world) – I forgot to generalise the principle. I ran up a quick data set to test the code and used something that generated ‘date-only’ values, but the column (and histogram) could hold ‘date and time’ values.
In the case of ‘date and time’, the time component is stored as the fraction of a day, for example, 6:00 pm on 29th Sept 2010 would be stored as: 2455469.75
This is why your format model ran into error 1481 – it didn’t match the input value. The following expression should work:
This splits the value into a day part and fraction of day, multiplies the fraction by the number of seconds in a day, converts both bits to character (with no spaces), concatenates them with a ‘.’ in the middle, and then uses the ‘Julian’ and ‘seconds’ conversion format.
UPDATE: see next comment – the starting expression needs to be: to_char(trunc(endpoint_value),’FM99999999′) — or you could use floor() instead of trunc()
Comment by Jonathan Lewis — September 29, 2010 @ 2:05 pm BST Sep 29,2010 |
Jonathan,
Had a problem with the expression above. For example, if you had the number 2454894.89011574 there appears to be some rounding to 2454895. I used the following and got the right values:
to_date(floor(endpoint_value) || ‘.’ ||
to_char(86400 * MOD(endpoint_value, 1), ‘FM999999999′)
,’J.sssss’)
Thanks for your work on this subject.
Comment by Eric Evans — September 30, 2010 @ 5:28 am BST Sep 30,2010 |
Eric,
Strange – are you sure you’re not seeing the effects of the default format for dates on your system ? Putting the value from your example into my formula (selecting from dual) gives me the date and time that I expect to see.
Comment by Jonathan Lewis — September 30, 2010 @ 7:01 pm BST Sep 30,2010 |
Eric,
Ignore my previous comment – it was just too early in the morning – you’re right, we need to floor() or trunc() the value.
Comment by Jonathan Lewis — October 2, 2010 @ 7:12 am BST Oct 2,2010 |
Yep. I am on 10.2.0.4 Linux and when I do
select to_char(2454894.89011574,’FM99999999′) char_val from dual;
CHAR_VAL
———
2454895
According to the documentation:
“All number format models cause the number to be rounded to the specified number of significant digits.”
I wonder what settings effect this rounding or is it an outright bug. I checked Metalink and could find nothing relevant. For what it’s worth, when I do a “show” of my NLS parameters, the only ones set are:
nls_language AMERICAN
nls_length_semantics BYTE
nls_nchar_conv_excp FALSE
nls_territory AMERICA
Comment by Eric Evans — October 1, 2010 @ 4:31 pm BST Oct 1,2010 |
Just add floor()
select to_char(floor(2454894.89011574), ‘FM999999999’) val from dual;
VAL
———-
2454894
Comment by fsm — October 2, 2010 @ 5:11 am BST Oct 2,2010 |
[…] Filed under: Statistics,Troubleshooting — Jonathan Lewis @ 6:25 pm UTC Oct 5,2010 In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra […]
Pingback by Frequency Histogram 4 « Oracle Scratchpad — October 5, 2010 @ 6:27 pm BST Oct 5,2010 |
Hi Jonathan,
I don’t know if it’s okay to ask a question that’s rather “peripherally related” here, but this being one of the more “basic” ones among your posts related to frequency histograms (and as I don’t find any related documentation on the net) I’ll just try :-;
Looking at dba_tab_col_statistics, I find lots of frequency histograms where num_buckets is (even substantially) lower than num_distinct (in 11.1.0). However, in the literature (also in Oracle’s Performing Tuning Guide) it always seems to say that the distinctive feature of a frequency histogram is that there’s one bucket per distinct value…
Am I getting this totally wrong here? Or might there be a (seldom mentioned?) algorithm that lets Oracle skip some values (perhaps being too infrequent?)
Thanks a lot in advance
Sigrid
Comment by Sigrid Keydana — November 8, 2010 @ 4:22 pm GMT Nov 8,2010 |
Sigrid,
It’s a good question, as it’s in the right technical area for this blog, suitably general in content, and one that’s commonly asked and addressed on the web.
You didn’t give me the details of a specific example – but I’d guess that the problem lies in sampling. If you have a relatively small number of distinct values that cover MOST of the data and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values.
A recent client had a column where two values covered about 10 million rows in a table, leaving a couple of hundred rows for the remaining five or six values. From time to time the stats collection would report that there were only two distinct values in the table – and it rarely managed to report every single value. It’s cases like this that you might want to write a program to create and fix some representative stats.
Comment by Jonathan Lewis — November 8, 2010 @ 6:44 pm GMT Nov 8,2010 |
Thanks a lot Jonathan for the answer.
Regarding your explanation,
” … and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values … ”
“… the stats collection would report that there were only two distinct values in the table …” –
what puzzles me here was that the sampling mechanism seems to be accurate enough to get the (possibly) correct number of distinct values, but then constructs the histogram in another way (so it even looked “on purpose” to me…).
Right now I’ve had the possibility to get the “real counts” at the same time as performing the statistics gathering on a clone of the production database, and picking a column with a substantial difference between num_buckets and num_distinct I get e.g.:
sqlplus> select mycol, count(1) from myschema.mytable group by mycol order by mycol;
mycol COUNT(1)
——————— ———-
1 1072
2 4
3 15334
6 7536
7 3315
10 473
11 61
12 124
20 900
42 1
50 979
55 1
62 2
71 25619
82 5141
83 8708
84 1224
85 1429
99 3518
116 1807
118 8
119 28
126 2142
128 1
129 3324
139 431
69223
sqlplus> select endpoint_value, endpoint_number from dba_tab_histograms where owner=’myschema’ and column_name=’mycol’ order by endpoint_value;
ENDPOINT_VALUE ENDPOINT_NUMBER
————– —————
1 38
3 626
6 900
7 996
10 1006
11 1007
12 1009
20 1032
50 1068
71 1979
82 2172
83 2503
84 2549
85 2603
99 2729
116 2779
126 2865
129 2983
139 3005
sqlplus> select column_name, histogram, num_nulls, num_buckets, num_distinct, abs(num_distinct – num_buckets) diff, sample_size, last_analyzed from dba_tab_col_statistics where histogram in (‘FREQUENCY’,
‘HEIGHT BALANCED’) and owner=’myschema’ and table_name=’mytable’ order by histogram, diff desc, column_name; 2
COLUMN_NAME HISTOGRAM NUM_NULLS NUM_BUCKETS NUM_DISTINCT DIFF SAMPLE_SIZE LAST_ANALYZED
—————————— ————— ———- ———– ———— ———- ———– ——————–
…
mycol FREQUENCY 69223 19 26 7 3005 09-NOV-2010 10:39:48
sqlplus> select count(*) from myschema.mytable where mycol is not null;
COUNT(*)
———-
83182
I unfortunately don’t have the time to pursue this further with the other columns now, but I wonder might there be some algorithm like “if the count of a value is less than e.g. total_count / sample_size (or some proportion of this), don’t build a bucket for it” – here for example, the “most frequent” value that got no bucket has count 28, and the ratio total_count / sample_size is 27.68… (of course this is not really much data for a guess yet :-;)
Thanks again,
Sigrid
Comment by Sigrid Keydana — November 9, 2010 @ 1:53 pm GMT Nov 9,2010 |