Oracle Scratchpad

October 5, 2010

Frequency Histogram 4

Filed under: Histograms,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 6:25 pm BST Oct 5,2010

In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.

The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:

  • Take the first 15 bytes of the string (after padding the string with nulls (for varchar2()) or spaces (for char() – until 11.2.0.4))
  • Treat the result as a hexadecimal number, and convert to decimal
  • Round to 15 significant digits using round(N, -21) and store as the endpoint_value
  • If duplicate rows appear, store the first 32 characters (increased to 64 for 12c) of each string as the endpoint_actual_value

Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).

With a nice friendly single-byte character code, the first 5 characters will be extracted correctly (assuming there are at least 5 characters in the string), and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):


rem
rem     Script:         read_char_hist.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2010
rem
rem     How to read a frequency histogram on a character column
rem
column endpoint_actual_value format a32
select
        endpoint_number                    cum_frequency,
        endpoint_number - prev_endpoint    frequency,
        hex_val,
        chr(to_number(substr(hex_val, 2,2),'XX')) ||
        chr(to_number(substr(hex_val, 4,2),'XX')) ||
        chr(to_number(substr(hex_val, 6,2),'XX')) ||
        chr(to_number(substr(hex_val, 8,2),'XX')) ||
        chr(to_number(substr(hex_val,10,2),'XX')) ||
        chr(to_number(substr(hex_val,12,2),'XX')),
        endpoint_actual_value
from    (
        select
                endpoint_number,
                lag(endpoint_number,1,0) over(
                        order by endpoint_number
                )                                                       prev_endpoint,
                to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val,
                endpoint_actual_value
        from
                dba_tab_histograms
        where
                owner = 'XXX'
        and     table_name = 'YYY'
        and     column_name = 'STATUS_COLUMN'
        )
order by
        endpoint_number
;

set doc off
doc

ENDPOINT_NUMBER  FREQUENCY HEX_VAL                         CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
          40254      40254  434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED
          40467        213  434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED
          40592        125  44454C49564550D642CA2965000000 DELIVE DELIVERED
          41304        712  494E564F49432991BF41C99E800000 INVOIC INVOICED
          41336         32  4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW
          41434         98  5041494400000C08C1A415AD800000 PAID   PAID
          41435          1  5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE
          41478         43  5049434B4544013F0FF93F6EC00000 PICKED PICKED
          41479          1  524546554E4436441DE2A321000000 REFUND REFUND MADE
          41480          1  524546554E4436441DE2A321000000 REFUND REFUND PENDING
          41482          2  52455455524E2F6693F753B6C00000 RETURN RETURNED

11 rows selected.

#

You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.

There are a number of side effects to the 15 byte / 32 character limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.

Footnote:

It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.

Update:

I’ve tested this code on versions up to 18.3.0.0 with the same results. In the course of the upgrade, though, the stored actual value extends from 32 bytes to 64 bytes for 12c, and the treatment of char() – as opposed to varchar2() – columns changed in 11.2.0.4 which means Oracle now strips trailing spaces from char() values and pads them with null (zero) bytes if they are shorter than the required 6 bytes.

[Further reading on Histograms]

22 Comments »

  1. Little typo in the footnote: you wrote “on the other hadn” instead of on “the other hand”.

    If I understand Oracle 11g has a different algorithm, right?

    Comment by lascoltodelvenerdi — October 7, 2010 @ 3:26 pm BST Oct 7,2010 | Reply

    • Thanks,
      I’ve fixed the typo.

      I haven’t seen a change in the algorithm for generating the content of a histogram in 11g – but I ran this example against 11.1, not 11.2.

      Comment by Jonathan Lewis — October 7, 2010 @ 7:08 pm BST Oct 7,2010 | Reply

  2. […] an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and […]

    Pingback by Frequency Histogram 5 « Oracle Scratchpad — October 13, 2010 @ 9:23 am BST Oct 13,2010 | Reply

  3. […] under: Statistics,Troubleshooting,Tuning — Jonathan Lewis @ 6:01 pm UTC Oct 19,2010 In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and […]

    Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010 | Reply

  4. rajesh@10GR2> SELECT endpoint_number,
      2    endpoint_number - NVL(prev_endpoint,0) frequency,
      3    hex_values,
      4    chr(to_number(SUBSTR(hex_values, 2,2),'XX'))
      5    || chr(to_number(SUBSTR(hex_values, 4,2),'XX'))
      6    || chr(to_number(SUBSTR(hex_values, 6,2),'XX'))
      7    || chr(to_number(SUBSTR(hex_values, 8,2),'XX'))
      8    || chr(to_number(SUBSTR(hex_values,10,2),'XX'))
      9    || chr(to_number(SUBSTR(hex_values,12,2),'XX'))
     10    || chr(to_number(SUBSTR(hex_values,14,2),'XX'))
     11    || chr(to_number(SUBSTR(hex_values,16,2),'XX'))
     12    || chr(to_number(SUBSTR(hex_values,18,2),'XX'))
     13    || chr(to_number(SUBSTR(hex_values,20,2),'XX'))
     14    || chr(to_number(SUBSTR(hex_values,22,2),'XX'))
     15    || chr(to_number(SUBSTR(hex_values,24,2),'XX'))
     16    as column_values
     17  FROM
     18    (SELECT endpoint_number,
     19      lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint,
     20      TO_CHAR(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_values,
     21      endpoint_actual_value
     22    FROM user_tab_histograms
     23    WHERE table_name  = 'T'
     24    AND column_name = 'OBJECT_NAME'
     25    )
     26  ORDER BY endpoint_number ;
    
    ENDPOINT_NUMBER  FREQUENCY HEX_VALUES                      COLUMN_VALUE
    --------------- ---------- ------------------------------- ------------
                  1          1  4144545F53544315EDB85726C00000 ADT_STC§f+W&
                  2          1  41534F5F53452F1F9B9266BEA00000 ASO_SE/?++f+
                  5          3  49525F5354474CEC7454244BE00000 IR_STGL8tT$K
                  7          2  49585F53544751CD2477F0CDA00000 IX_STGQ-$w=-
                  8          1  504B475F50523F5F66AC4457A00000 PKG_PR?_f¼DW
                  9          1  52505449524141D86FB6896A200000 RPTIRAA+o¦+j
                 10          1  52505449524141D86FB6896A200000 RPTIRAA+o¦+j
    
    7 rows selected.
    
    Elapsed: 00:00:00.90
    rajesh@10GR2>
    

    I am not getting the proper values in the column COLUMN_VALUE. Is this issue is due to Character set?

    Comment by Rajeshwaran — January 21, 2011 @ 3:46 pm GMT Jan 21,2011 | Reply

    • Rajeshwaran,
      This seems to be covered by my comment:

        With a nice friendly single-byte character code, the first 5 characters will be extracted correctly, and the sixth will be pretty close to the original.

      I stopped at six characters because I knew that the 7th onwards would be garbage.

      Comment by Jonathan Lewis — January 21, 2011 @ 5:19 pm GMT Jan 21,2011 | Reply

  5. How did you arrived at the algorithm described above?

    Is that available in Oracle product documentation? if yes, can you provide link to that?

    Comment by Rajeshwaran — January 22, 2011 @ 8:39 am GMT Jan 22,2011 | Reply

  6. Jonathan,

    I just came across this while I was working on a different issue – I think your description above how Oracle arrives at the numeric representation is not consistent with what you’ve outlined in “Cost Based Oracle – Fundamentals”.

    The first step should read: Take the first 15 bytes, not six (Of course it’s obvious where the six comes from, but the result will not be consistent with Oracle’s result when using only the first six bytes).

    By the way, for a original string less than six (single-byte) characters the last character can be incorrect due to the rounding applied, so you don’t end up necessarily with five correctly extracted characters sometimes.

    Randolf

    Comment by Randolf Geist — June 9, 2011 @ 8:27 pm BST Jun 9,2011 | Reply

    • Randolf,

      Thanks for pointing out the error – now fixed. I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.

      You’re right about the “less than six characters” of course. I actually gave an example of this with the “NEW / NEVÿÿÿ” switch in the example, but I’ve added in a comment to that effect where I make the point about the 6th character possibly being wrong.

      Comment by Jonathan Lewis — June 10, 2011 @ 11:11 am BST Jun 10,2011 | Reply

      • > I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.

        May be 0x20 = 32?

        Randolf

        Comment by Randolf Geist — June 10, 2011 @ 11:26 am BST Jun 10,2011 | Reply

  7. […] I’ve shown a method of converting the endpoint_value as it’s stored for character columns from its numeric form to its character equivalent, if you have ASCII codes at your finger tips you’ll spot lots of spaces (0×20) appended to the hex representation of the endpoint value – for details of how the transformation is actually performed see: https://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/ […]

    Pingback by Histograms Part 1 – Why? – All Things Oracle — August 28, 2013 @ 4:22 pm BST Aug 28,2013 | Reply

  8. […] wrote a note a few years ago about translating the endpoint_value for histograms on character columns, and in that note I left […]

    Pingback by endpoint_value | Oracle Scratchpad — May 5, 2014 @ 2:58 pm BST May 5,2014 | Reply

  9. Hi Jonathan,

    I am trying to fake the histograms for a CHAR column. For a CHAR datatype column, when gather_table_stats was run, the endpoint_actual_value had the char values like ‘JUMBO BOX’, ‘JUMBO CAR’, etc. in user_tab_histograms table.

    I am trying to write (fake) statistics as part of an experiment and I am using prepare_column_values and set_column_stats to create the histogram details, but I am not able to specify the endpoint_actual_value.

    Could you help me with a simple example on how to write the endpoint_actual_value to oracle histograms?

    Oracle Version : Oracle Express Edition 11g.

    Thank you.

    Comment by ashokeiisc — May 12, 2015 @ 8:05 am BST May 12,2015 | Reply

    • ashokeiisc,

      Just as a first quick check – have you included in your array of values two values which match on at least the first 6 characters (as ‘JUMBO CAR’ and ‘JUMBO BOX’ have) ?

      You get endpoint_actual_values recorded only if there are apparent duplicates in endpoint_values – and it typically requires about the first 6 characters to match in the original strings for this to happen.

      Comment by Jonathan Lewis — May 12, 2015 @ 9:05 am BST May 12,2015 | Reply

  10. […] suspect the problem is related to the way that character histograms are built based on a numeric representation of the string that takes the first few characters of the string, […]

    Pingback by Histogram Limit | Oracle Scratchpad — October 24, 2015 @ 7:12 pm BST Oct 24,2015 | Reply

  11. […] There’s a brief summary of the algorithm Oracle uses to generate values for character-based histograms at this URL. […]

    Pingback by Histogram Upgrade | Oracle Scratchpad — December 2, 2016 @ 3:09 pm GMT Dec 2,2016 | Reply

  12. […] put it another way, how do we get 0.000157 as the selectivity of the predicate. We need to refer to a note I wrote a few years ago to help us on our way (with a little caveat due to a change that appeared in 11.2.0.4) – what […]

    Pingback by Character selectivity | Oracle Scratchpad — April 14, 2017 @ 12:40 pm BST Apr 14,2017 | Reply

  13. […] has been populated in this example. This is because (with Oracle’s interesting transformation from character to numeric) the three ‘PPPPPPx’ character values turn into the same number – so Oracle […]

    Pingback by Hybrid Fake | Oracle Scratchpad — October 10, 2018 @ 3:13 pm BST Oct 10,2018 | Reply

  14. […] vengono presi i primi 32 byte. Jonathan Lewis da più dettagli in una serie di post, anche se si riferisce ai “Frequency Histograms”, per cui dovrò indagare […]

    Pingback by Istogrammi su tipo dato *CHAR – Oracle and other — December 20, 2018 @ 2:15 pm GMT Dec 20,2018 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.