Oracle Scratchpad

September 20, 2010

Frequency Histograms 2

Filed under: Histograms,Statistics — Jonathan Lewis @ 6:18 pm GMT Sep 20,2010

I find it convenient occasionally to “translate” a frequency histogram into a report of the underlying data (sometimes to see how closely the histogram matches the real data). To demonstrate the type of query I use I’ve created a data set with a small number of distinct values and generated a frequency histogram on the data set. This is what the data and histogram look like:


SQL> desc t1
 Name                    Null?    Type
 ----------------------- -------- ----------
 N1                      NOT NULL NUMBER

SQL> select n1, count(*) from t1 group by n1 order by n1;

        N1   COUNT(*)
---------- ----------
        10          4
        20          3
        30          6
        40         38
        50          4
        60         13
        70          6
        80         41
        90          2
       100          3

SQL> select
  2     endpoint_number, endpoint_value
  3  from
  4     user_tab_histograms
  5  where
  6     table_name  = 'T1'
  7  and        column_name = 'N1'
  8  order
  9     by endpoint_number
 10  ;

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
              4             10
              7             20
             13             30
             51             40
             55             50
             68             60
             74             70
            115             80
            117             90
            120            100

If you look at the histogram data you’ll see that the “endpoint_value” column holds a list of the values that appear in column n1, and the “endpoint_number” is the ‘cumulative frequency’ for the appearances of the endpoint_value – the number 10 appears 4 times, the number 20 appears 7 – 4 = 3 time, the value 30 appears 13 – 7 = 6 times, and so on.

Once we’ve recognised this pattern we can write some SQL to convert the histogram back to the list of values it represents by using the analytic lag() function. I’ll demonstrate this in two steps – first generating an intermediate result to show the effect of the lag() function.

select
	endpoint_number,
	lag(endpoint_number,1) over(
		order by endpoint_number
	)		delta,
	endpoint_value
from
	user_tab_histograms
where
	table_name  = 'T1'
and	column_name = 'N1'
order by
	endpoint_number
;

ENDPOINT_NUMBER      DELTA ENDPOINT_VALUE
--------------- ---------- --------------
              4                        10
              7          4             20
             13          7             30
             51         13             40
             55         51             50
             68         55             60
             74         68             70
            115         74             80
            117        115             90
            120        117            100

The lag(column, N) function requires an order clause in its over() clause and has the effect of reporting a value N rows later than it would otherwise appear (relative to that ordering).

As soon as we’ve got two consecutive endpoint numbers being reported as two columns in the same row we can find the difference – which gives us the answer we need:

select
	endpoint_value				column_value,
	endpoint_number - nvl(prev_endpoint,0)	frequency
from	(
	select
		endpoint_number,
		lag(endpoint_number,1) over(
			order by endpoint_number
		) 				prev_endpoint,
		endpoint_value
	from
		user_tab_histograms
	where
		table_name  = 'T1'
	and	column_name = 'N1'
	)
order by
	endpoint_number
;

COLUMN_VALUE  FREQUENCY
------------ ----------
          10          4
          20          3
          30          6
          40         38
          50          4
          60         13
          70          6
          80         41
          90          2
         100          3

This is the code for numeric columns; for data columns you simply need to select to_date(to_char(endpoint_value,’J’)) rather than just endpoint_value(Edit: this works for date-only values, but if you stored date and time then see the correction in comments 3 and 4 and below). Character strings, though are more of a problem in the general case – and I may talk about those in another blog.

Note, by the way, that the mechanism that Oracle uses to generate the histogram data means that the results appear in the same order whether you sort by endpoint_value or endpoint_number. (Unless – as can happen with character columns, a point I initially forget – the same endpoint_value appears in two rows: so I’ve edited the original code where I had used endpoint_value to use endpoint_number)

[Further reading on Histograms]

16 Comments »

  1. Mr. John
    End of day, I hope you are insisting on this, with the above comparison between real and histogram if we find any vast difference, then we should perform collecting histograms post collecting the regular stats.

    Hope I have understood what you have conveyed.

    Regards
    V.Hari

    Comment by V.Hari — September 21, 2010 @ 1:35 pm GMT Sep 21,2010 | Reply

    • V.Hari,

      I can’t make out what you are saying.. but you should atleast write the name correctly – his name is Jonathan not John.

      Comment by CJ — September 28, 2010 @ 1:45 pm GMT Sep 28,2010 | Reply

    • V.Hari,

      Rather more importantly – if this code allows you to see that the histogram doesn’t represent real life then you have a clue that you may need to write a program to generate an “artificial” histogram and stop Oracle from collecting one.

      Comment by Jonathan Lewis — September 29, 2010 @ 1:57 pm GMT Sep 29,2010 | Reply

  2. […] an earlier blog I wrote about a query that turned the figures from a frequency histogram into a list of column […]

    Pingback by Frequency Histogram 3 « Oracle Scratchpad — September 24, 2010 @ 7:13 pm GMT Sep 24,2010 | Reply

  3. Jonathan sir,

    I tried using TO_DATE(TO_CHAR(, ‘J’)) for converting endpoint_value of date fields:

    SQL> select endpoint_number, to_date(to_char(endpoint_value,’J’)) ep_value
    2 from dba_tab_histograms
    3 where table_name = ‘TBL_BETS_FILLED’
    4 and owner = ‘ODS’ and column_name = ‘PLACED_DATE’
    5 and rownum = 1;
    select endpoint_number, to_date(to_char(endpoint_value,’J’)) ep_value
    *
    ERROR at line 1:
    ORA-01481: invalid number format model

    Comment by CJ — September 28, 2010 @ 1:38 pm GMT Sep 28,2010 | Reply

    • I posted an incomplete entry above, but what I wanted to say was I receive the ‘ORA-01481: Invalid number format model’ error. Any ideas?

      Thanks.

      Comment by CJ — September 28, 2010 @ 1:40 pm GMT Sep 28,2010 | Reply

    • CJ,

      This was my error (and a common error in the Oracle world) – I forgot to generalise the principle. I ran up a quick data set to test the code and used something that generated ‘date-only’ values, but the column (and histogram) could hold ‘date and time’ values.

      In the case of ‘date and time’, the time component is stored as the fraction of a day, for example, 6:00 pm on 29th Sept 2010 would be stored as: 2455469.75

      This is why your format model ran into error 1481 – it didn’t match the input value. The following expression should work:

      	to_date(
                      to_char(endpoint_value,'FM99999999') || '.' ||
                      to_char(86400 * mod(endpoint_value,1),'FM99999'),
              	'J.sssss'
      	) ep_value
      

      This splits the value into a day part and fraction of day, multiplies the fraction by the number of seconds in a day, converts both bits to character (with no spaces), concatenates them with a ‘.’ in the middle, and then uses the ‘Julian’ and ‘seconds’ conversion format.

      Comment by Jonathan Lewis — September 29, 2010 @ 2:05 pm GMT Sep 29,2010 | Reply

  4. Jonathan,

    Had a problem with the expression above. For example, if you had the number 2454894.89011574 there appears to be some rounding to 2454895. I used the following and got the right values:

    to_date(floor(endpoint_value) || ‘.’ ||
    to_char(86400 * MOD(endpoint_value, 1), ‘FM999999999′)
    ,’J.sssss’)

    Thanks for your work on this subject.

    Comment by Eric Evans — September 30, 2010 @ 5:28 am GMT Sep 30,2010 | Reply

    • Eric,

      Strange – are you sure you’re not seeing the effects of the default format for dates on your system ? Putting the value from your example into my formula (selecting from dual) gives me the date and time that I expect to see.

      Comment by Jonathan Lewis — September 30, 2010 @ 7:01 pm GMT Sep 30,2010 | Reply

    • Eric,

      Ignore my previous comment – it was just too early in the morning – you’re right, we need to floor() or trunc() the value.

      Comment by Jonathan Lewis — October 2, 2010 @ 7:12 am GMT Oct 2,2010 | Reply

  5. Yep. I am on 10.2.0.4 Linux and when I do

    select to_char(2454894.89011574,’FM99999999′) char_val from dual;

    CHAR_VAL
    ———
    2454895

    According to the documentation:
    “All number format models cause the number to be rounded to the specified number of significant digits.”

    I wonder what settings effect this rounding or is it an outright bug. I checked Metalink and could find nothing relevant. For what it’s worth, when I do a “show” of my NLS parameters, the only ones set are:
    nls_language AMERICAN
    nls_length_semantics BYTE
    nls_nchar_conv_excp FALSE
    nls_territory AMERICA

    Comment by Eric Evans — October 1, 2010 @ 4:31 pm GMT Oct 1,2010 | Reply

    • Just add floor()

      select to_char(floor(2454894.89011574), ‘FM999999999′) val from dual;

      VAL
      ———-
      2454894

      Comment by fsm — October 2, 2010 @ 5:11 am GMT Oct 2,2010 | Reply

  6. […] Filed under: Statistics,Troubleshooting — Jonathan Lewis @ 6:25 pm UTC Oct 5,2010 In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra […]

    Pingback by Frequency Histogram 4 « Oracle Scratchpad — October 5, 2010 @ 6:27 pm GMT Oct 5,2010 | Reply

  7. Hi Jonathan,

    I don’t know if it’s okay to ask a question that’s rather “peripherally related” here, but this being one of the more “basic” ones among your posts related to frequency histograms (and as I don’t find any related documentation on the net) I’ll just try :-;
    Looking at dba_tab_col_statistics, I find lots of frequency histograms where num_buckets is (even substantially) lower than num_distinct (in 11.1.0). However, in the literature (also in Oracle’s Performing Tuning Guide) it always seems to say that the distinctive feature of a frequency histogram is that there’s one bucket per distinct value…
    Am I getting this totally wrong here? Or might there be a (seldom mentioned?) algorithm that lets Oracle skip some values (perhaps being too infrequent?)

    Thanks a lot in advance
    Sigrid

    Comment by Sigrid Keydana — November 8, 2010 @ 4:22 pm GMT Nov 8,2010 | Reply

    • Sigrid,

      It’s a good question, as it’s in the right technical area for this blog, suitably general in content, and one that’s commonly asked and addressed on the web.

      You didn’t give me the details of a specific example – but I’d guess that the problem lies in sampling. If you have a relatively small number of distinct values that cover MOST of the data and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values.

      A recent client had a column where two values covered about 10 million rows in a table, leaving a couple of hundred rows for the remaining five or six values. From time to time the stats collection would report that there were only two distinct values in the table – and it rarely managed to report every single value. It’s cases like this that you might want to write a program to create and fix some representative stats.

      Comment by Jonathan Lewis — November 8, 2010 @ 6:44 pm GMT Nov 8,2010 | Reply

      • Thanks a lot Jonathan for the answer.

        Regarding your explanation,

        ” … and the rest of values represent a tiny fraction of the data then the sampling mechanism that Oracle uses is quite likely to miss some of the scarce values … ”
        “… the stats collection would report that there were only two distinct values in the table …” –

        what puzzles me here was that the sampling mechanism seems to be accurate enough to get the (possibly) correct number of distinct values, but then constructs the histogram in another way (so it even looked “on purpose” to me…).

        Right now I’ve had the possibility to get the “real counts” at the same time as performing the statistics gathering on a clone of the production database, and picking a column with a substantial difference between num_buckets and num_distinct I get e.g.:

        sqlplus> select mycol, count(1) from myschema.mytable group by mycol order by mycol;

        mycol COUNT(1)
        ——————— ———-
        1 1072
        2 4
        3 15334
        6 7536
        7 3315
        10 473
        11 61
        12 124
        20 900
        42 1
        50 979
        55 1
        62 2
        71 25619
        82 5141
        83 8708
        84 1224
        85 1429
        99 3518
        116 1807
        118 8
        119 28
        126 2142
        128 1
        129 3324
        139 431
        69223

        sqlplus> select endpoint_value, endpoint_number from dba_tab_histograms where owner=’myschema’ and column_name=’mycol’ order by endpoint_value;

        ENDPOINT_VALUE ENDPOINT_NUMBER
        ————– —————
        1 38
        3 626
        6 900
        7 996
        10 1006
        11 1007
        12 1009
        20 1032
        50 1068
        71 1979
        82 2172
        83 2503
        84 2549
        85 2603
        99 2729
        116 2779
        126 2865
        129 2983
        139 3005

        sqlplus> select column_name, histogram, num_nulls, num_buckets, num_distinct, abs(num_distinct – num_buckets) diff, sample_size, last_analyzed from dba_tab_col_statistics where histogram in (‘FREQUENCY’,
        ‘HEIGHT BALANCED’) and owner=’myschema’ and table_name=’mytable’ order by histogram, diff desc, column_name; 2

        COLUMN_NAME HISTOGRAM NUM_NULLS NUM_BUCKETS NUM_DISTINCT DIFF SAMPLE_SIZE LAST_ANALYZED
        —————————— ————— ———- ———– ———— ———- ———– ——————–

        mycol FREQUENCY 69223 19 26 7 3005 09-NOV-2010 10:39:48

        sqlplus> select count(*) from myschema.mytable where mycol is not null;

        COUNT(*)
        ———-
        83182

        I unfortunately don’t have the time to pursue this further with the other columns now, but I wonder might there be some algorithm like “if the count of a value is less than e.g. total_count / sample_size (or some proportion of this), don’t build a bucket for it” – here for example, the “most frequent” value that got no bucket has count 28, and the ratio total_count / sample_size is 27.68… (of course this is not really much data for a guess yet :-;)

        Thanks again,
        Sigrid

        Comment by Sigrid Keydana — November 9, 2010 @ 1:53 pm GMT Nov 9,2010 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 4,556 other followers