Oracle Scratchpad

October 5, 2010

Frequency Histogram 4

Filed under: Histograms,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 6:25 pm BST Oct 5,2010

In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.

The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:

  • Take the first 15 bytes of the string (after padding the string with nulls (for varchar2) or spaces (for char))
  • Treat the result as a hexadecimal number, and convert to decimal
  • Round to 15 significant digits and store as the endpoint_value
  • If duplicate rows appear, store the first 32 characters (increased to 64 for 12c) of each string as the endpoint_actual_value

Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).

With a nice friendly single-byte character code, the first 5 characters will be extracted correctly (assuming there are at least 5 characters in the string), and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):

rem     How to read a frequency histogram on a character column

        endpoint_number - nvl(prev_endpoint,0)  frequency,
        chr(to_number(substr(hex_val, 2,2),'XX')) ||
        chr(to_number(substr(hex_val, 4,2),'XX')) ||
        chr(to_number(substr(hex_val, 6,2),'XX')) ||
        chr(to_number(substr(hex_val, 8,2),'XX')) ||
        chr(to_number(substr(hex_val,10,2),'XX')) ||
from    (
                lag(endpoint_number,1) over(
                        order by endpoint_number
                )                                                       prev_endpoint,
                owner = 'XXX'
        and     table_name = 'YYY'
        and     column_name = 'STATUS_COLUMN'
order by

set doc off

--------------- ---------- ------------------------------- ------ ------------------------------------------
          40254      40254  434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED
          40467        213  434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED
          40592        125  44454C49564550D642CA2965000000 DELIVE DELIVERED
          41304        712  494E564F49432991BF41C99E800000 INVOIC INVOICED
          41336         32  4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW
          41434         98  5041494400000C08C1A415AD800000 PAID   PAID
          41435          1  5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE
          41478         43  5049434B4544013F0FF93F6EC00000 PICKED PICKED
          41479          1  524546554E4436441DE2A321000000 REFUND REFUND MADE
          41480          1  524546554E4436441DE2A321000000 REFUND REFUND PENDING
          41482          2  52455455524E2F6693F753B6C00000 RETURN RETURNED

11 rows selected.


You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.

There are a number of side effects to the 15 byte / 32 character limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.

Footnote: It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.

[Further reading on Histograms]


  1. Little typo in the footnote: you wrote “on the other hadn” instead of on “the other hand”.

    If I understand Oracle 11g has a different algorithm, right?

    Comment by lascoltodelvenerdi — October 7, 2010 @ 3:26 pm BST Oct 7,2010 | Reply

    • Thanks,
      I’ve fixed the typo.

      I haven’t seen a change in the algorithm for generating the content of a histogram in 11g – but I ran this example against 11.1, not 11.2.

      Comment by Jonathan Lewis — October 7, 2010 @ 7:08 pm BST Oct 7,2010 | Reply

  2. [...] an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and [...]

    Pingback by Frequency Histogram 5 « Oracle Scratchpad — October 13, 2010 @ 9:23 am BST Oct 13,2010 | Reply

  3. [...] under: Statistics,Troubleshooting,Tuning — Jonathan Lewis @ 6:01 pm UTC Oct 19,2010 In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and [...]

    Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010 | Reply

  4. rajesh@10GR2> SELECT endpoint_number,
      2    endpoint_number - NVL(prev_endpoint,0) frequency,
      3    hex_values,
      4    chr(to_number(SUBSTR(hex_values, 2,2),'XX'))
      5    || chr(to_number(SUBSTR(hex_values, 4,2),'XX'))
      6    || chr(to_number(SUBSTR(hex_values, 6,2),'XX'))
      7    || chr(to_number(SUBSTR(hex_values, 8,2),'XX'))
      8    || chr(to_number(SUBSTR(hex_values,10,2),'XX'))
      9    || chr(to_number(SUBSTR(hex_values,12,2),'XX'))
     10    || chr(to_number(SUBSTR(hex_values,14,2),'XX'))
     11    || chr(to_number(SUBSTR(hex_values,16,2),'XX'))
     12    || chr(to_number(SUBSTR(hex_values,18,2),'XX'))
     13    || chr(to_number(SUBSTR(hex_values,20,2),'XX'))
     14    || chr(to_number(SUBSTR(hex_values,22,2),'XX'))
     15    || chr(to_number(SUBSTR(hex_values,24,2),'XX'))
     16    as column_values
     17  FROM
     18    (SELECT endpoint_number,
     19      lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint,
     20      TO_CHAR(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_values,
     21      endpoint_actual_value
     22    FROM user_tab_histograms
     23    WHERE table_name  = 'T'
     24    AND column_name = 'OBJECT_NAME'
     25    )
     26  ORDER BY endpoint_number ;
    --------------- ---------- ------------------------------- ------------
                  1          1  4144545F53544315EDB85726C00000 ADT_STC§f+W&
                  2          1  41534F5F53452F1F9B9266BEA00000 ASO_SE/?++f+
                  5          3  49525F5354474CEC7454244BE00000 IR_STGL8tT$K
                  7          2  49585F53544751CD2477F0CDA00000 IX_STGQ-$w=-
                  8          1  504B475F50523F5F66AC4457A00000 PKG_PR?_f¼DW
                  9          1  52505449524141D86FB6896A200000 RPTIRAA+o¦+j
                 10          1  52505449524141D86FB6896A200000 RPTIRAA+o¦+j
    7 rows selected.
    Elapsed: 00:00:00.90

    I am not getting the proper values in the column COLUMN_VALUE. Is this issue is due to Character set?

    Comment by Rajeshwaran — January 21, 2011 @ 3:46 pm BST Jan 21,2011 | Reply

    • Rajeshwaran,
      This seems to be covered by my comment:

        With a nice friendly single-byte character code, the first 5 characters will be extracted correctly, and the sixth will be pretty close to the original.

      I stopped at six characters because I knew that the 7th onwards would be garbage.

      Comment by Jonathan Lewis — January 21, 2011 @ 5:19 pm BST Jan 21,2011 | Reply

  5. How did you arrived at the algorithm described above?

    Is that available in Oracle product documentation? if yes, can you provide link to that?

    Comment by Rajeshwaran — January 22, 2011 @ 8:39 am BST Jan 22,2011 | Reply

  6. Jonathan,

    I just came across this while I was working on a different issue – I think your description above how Oracle arrives at the numeric representation is not consistent with what you’ve outlined in “Cost Based Oracle – Fundamentals”.

    The first step should read: Take the first 15 bytes, not six (Of course it’s obvious where the six comes from, but the result will not be consistent with Oracle’s result when using only the first six bytes).

    By the way, for a original string less than six (single-byte) characters the last character can be incorrect due to the rounding applied, so you don’t end up necessarily with five correctly extracted characters sometimes.


    Comment by Randolf Geist — June 9, 2011 @ 8:27 pm BST Jun 9,2011 | Reply

    • Randolf,

      Thanks for pointing out the error – now fixed. I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.

      You’re right about the “less than six characters” of course. I actually gave an example of this with the “NEW / NEVÿÿÿ” switch in the example, but I’ve added in a comment to that effect where I make the point about the 6th character possibly being wrong.

      Comment by Jonathan Lewis — June 10, 2011 @ 11:11 am BST Jun 10,2011 | Reply

      • > I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.

        May be 0×20 = 32?


        Comment by Randolf Geist — June 10, 2011 @ 11:26 am BST Jun 10,2011 | Reply

  7. […] I’ve shown a method of converting the endpoint_value as it’s stored for character columns from its numeric form to its character equivalent, if you have ASCII codes at your finger tips you’ll spot lots of spaces (0×20) appended to the hex representation of the endpoint value – for details of how the transformation is actually performed see: […]

    Pingback by Histograms Part 1 – Why? – All Things Oracle — August 28, 2013 @ 4:22 pm BST Aug 28,2013 | Reply

  8. […] wrote a note a few years ago about translating the endpoint_value for histograms on character columns, and in that note I left […]

    Pingback by endpoint_value | Oracle Scratchpad — May 5, 2014 @ 2:58 pm BST May 5,2014 | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at


Get every new post delivered to your Inbox.

Join 3,877 other followers