Oracle Scratchpad

October 5, 2010

Frequency Histogram 4

Filed under: Histograms,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 6:25 pm BST Oct 5,2010

In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.

The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:

  • Take the first 15 bytes of the string (after padding it with nulls (varchar) or spaces (char))
  • View this as a hexadecimal number, and convert to decimal
  • Round to 15 significant digits and store as the endpoint_value
  • If duplicate rows appear, store the first 32 characters (64 for 12c) of each string as the endpoint_actual_value

Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).

With a nice friendly single-byte character code, the first 5 characters will be extracted correctly (assuming there are at least characters in the string), and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):


rem
rem     How to read a frequency histogram on a character column
rem

select
        endpoint_number,
        endpoint_number - nvl(prev_endpoint,0)  frequency,
        hex_val,
        chr(to_number(substr(hex_val, 2,2),'XX')) ||
        chr(to_number(substr(hex_val, 4,2),'XX')) ||
        chr(to_number(substr(hex_val, 6,2),'XX')) ||
        chr(to_number(substr(hex_val, 8,2),'XX')) ||
        chr(to_number(substr(hex_val,10,2),'XX')) ||
        chr(to_number(substr(hex_val,12,2),'XX')),
        endpoint_actual_value
from    (
        select
                endpoint_number,
                lag(endpoint_number,1) over(
                        order by endpoint_number
                )                                                       prev_endpoint,
                to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val,
                endpoint_actual_value
        from
                dba_tab_histograms
        where
                owner = 'XXX'
        and     table_name = 'YYY'
        and     column_name = 'STATUS_COLUMN'
        )
order by
        endpoint_number
;

set doc off
doc

ENDPOINT_NUMBER  FREQUENCY HEX_VAL                         CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
          40254      40254  434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED
          40467        213  434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED
          40592        125  44454C49564550D642CA2965000000 DELIVE DELIVERED
          41304        712  494E564F49432991BF41C99E800000 INVOIC INVOICED
          41336         32  4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW
          41434         98  5041494400000C08C1A415AD800000 PAID   PAID
          41435          1  5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE
          41478         43  5049434B4544013F0FF93F6EC00000 PICKED PICKED
          41479          1  524546554E4436441DE2A321000000 REFUND REFUND MADE
          41480          1  524546554E4436441DE2A321000000 REFUND REFUND PENDING
          41482          2  52455455524E2F6693F753B6C00000 RETURN RETURNED

11 rows selected.

#

You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.

There are a number of side effects to the 15 byte / 32 byte limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.

Footnote: It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.

[Further reading on Histograms]

13 Comments »

  1. Little typo in the footnote: you wrote “on the other hadn” instead of on “the other hand”.

    If I understand Oracle 11g has a different algorithm, right?

    Comment by lascoltodelvenerdi — October 7, 2010 @ 3:26 pm BST Oct 7,2010 | Reply

    • Thanks,
      I’ve fixed the typo.

      I haven’t seen a change in the algorithm for generating the content of a histogram in 11g – but I ran this example against 11.1, not 11.2.

      Comment by Jonathan Lewis — October 7, 2010 @ 7:08 pm BST Oct 7,2010 | Reply

  2. [...] an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and [...]

    Pingback by Frequency Histogram 5 « Oracle Scratchpad — October 13, 2010 @ 9:23 am BST Oct 13,2010 | Reply

  3. [...] under: Statistics,Troubleshooting,Tuning — Jonathan Lewis @ 6:01 pm UTC Oct 19,2010 In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and [...]

    Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010 | Reply

  4. rajesh@10GR2> SELECT endpoint_number,
      2    endpoint_number - NVL(prev_endpoint,0) frequency,
      3    hex_values,
      4    chr(to_number(SUBSTR(hex_values, 2,2),'XX'))
      5    || chr(to_number(SUBSTR(hex_values, 4,2),'XX'))
      6    || chr(to_number(SUBSTR(hex_values, 6,2),'XX'))
      7    || chr(to_number(SUBSTR(hex_values, 8,2),'XX'))
      8    || chr(to_number(SUBSTR(hex_values,10,2),'XX'))
      9    || chr(to_number(SUBSTR(hex_values,12,2),'XX'))
     10    || chr(to_number(SUBSTR(hex_values,14,2),'XX'))
     11    || chr(to_number(SUBSTR(hex_values,16,2),'XX'))
     12    || chr(to_number(SUBSTR(hex_values,18,2),'XX'))
     13    || chr(to_number(SUBSTR(hex_values,20,2),'XX'))
     14    || chr(to_number(SUBSTR(hex_values,22,2),'XX'))
     15    || chr(to_number(SUBSTR(hex_values,24,2),'XX'))
     16    as column_values
     17  FROM
     18    (SELECT endpoint_number,
     19      lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint,
     20      TO_CHAR(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_values,
     21      endpoint_actual_value
     22    FROM user_tab_histograms
     23    WHERE table_name  = 'T'
     24    AND column_name = 'OBJECT_NAME'
     25    )
     26  ORDER BY endpoint_number ;
    
    ENDPOINT_NUMBER  FREQUENCY HEX_VALUES                      COLUMN_VALUE
    --------------- ---------- ------------------------------- ------------
                  1          1  4144545F53544315EDB85726C00000 ADT_STC§f+W&
                  2          1  41534F5F53452F1F9B9266BEA00000 ASO_SE/?++f+
                  5          3  49525F5354474CEC7454244BE00000 IR_STGL8tT$K
                  7          2  49585F53544751CD2477F0CDA00000 IX_STGQ-$w=-
                  8          1  504B475F50523F5F66AC4457A00000 PKG_PR?_f¼DW
                  9          1  52505449524141D86FB6896A200000 RPTIRAA+o¦+j
                 10          1  52505449524141D86FB6896A200000 RPTIRAA+o¦+j
    
    7 rows selected.
    
    Elapsed: 00:00:00.90
    rajesh@10GR2>
    

    I am not getting the proper values in the column COLUMN_VALUE. Is this issue is due to Character set?

    Comment by Rajeshwaran — January 21, 2011 @ 3:46 pm BST Jan 21,2011 | Reply

    • Rajeshwaran,
      This seems to be covered by my comment:

        With a nice friendly single-byte character code, the first 5 characters will be extracted correctly, and the sixth will be pretty close to the original.

      I stopped at six characters because I knew that the 7th onwards would be garbage.

      Comment by Jonathan Lewis — January 21, 2011 @ 5:19 pm BST Jan 21,2011 | Reply

  5. How did you arrived at the algorithm described above?

    Is that available in Oracle product documentation? if yes, can you provide link to that?

    Comment by Rajeshwaran — January 22, 2011 @ 8:39 am BST Jan 22,2011 | Reply

  6. Jonathan,

    I just came across this while I was working on a different issue – I think your description above how Oracle arrives at the numeric representation is not consistent with what you’ve outlined in “Cost Based Oracle – Fundamentals”.

    The first step should read: Take the first 15 bytes, not six (Of course it’s obvious where the six comes from, but the result will not be consistent with Oracle’s result when using only the first six bytes).

    By the way, for a original string less than six (single-byte) characters the last character can be incorrect due to the rounding applied, so you don’t end up necessarily with five correctly extracted characters sometimes.

    Randolf

    Comment by Randolf Geist — June 9, 2011 @ 8:27 pm BST Jun 9,2011 | Reply

    • Randolf,

      Thanks for pointing out the error – now fixed. I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.

      You’re right about the “less than six characters” of course. I actually gave an example of this with the “NEW / NEVÿÿÿ” switch in the example, but I’ve added in a comment to that effect where I make the point about the 6th character possibly being wrong.

      Comment by Jonathan Lewis — June 10, 2011 @ 11:11 am BST Jun 10,2011 | Reply

      • > I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.

        May be 0×20 = 32?

        Randolf

        Comment by Randolf Geist — June 10, 2011 @ 11:26 am BST Jun 10,2011 | Reply

  7. […] I’ve shown a method of converting the endpoint_value as it’s stored for character columns from its numeric form to its character equivalent, if you have ASCII codes at your finger tips you’ll spot lots of spaces (0×20) appended to the hex representation of the endpoint value – for details of how the transformation is actually performed see: http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/ […]

    Pingback by Histograms Part 1 – Why? – All Things Oracle — August 28, 2013 @ 4:22 pm BST Aug 28,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,454 other followers