October 5, 2010

Frequency Histogram 4

Filed under: Histograms,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 6:25 pm BST Oct 5,2010

In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.

The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:

• Take the first 15 bytes of the string (after padding the string with nulls (for varchar2()) or spaces (for char() – until 11.2.0.4))
• Treat the result as a hexadecimal number, and convert to decimal
• Round to 15 significant digits using round(N, -21) and store as the endpoint_value
• If duplicate rows appear, store the first 32 characters (increased to 64 for 12c) of each string as the endpoint_actual_value

Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).

With a nice friendly single-byte character code, the first 5 characters will be extracted correctly (assuming there are at least 5 characters in the string), and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):

```
rem
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2010
rem
rem     How to read a frequency histogram on a character column
rem
column endpoint_actual_value format a32
select
endpoint_number                    cum_frequency,
endpoint_number - prev_endpoint    frequency,
hex_val,
chr(to_number(substr(hex_val, 2,2),'XX')) ||
chr(to_number(substr(hex_val, 4,2),'XX')) ||
chr(to_number(substr(hex_val, 6,2),'XX')) ||
chr(to_number(substr(hex_val, 8,2),'XX')) ||
chr(to_number(substr(hex_val,10,2),'XX')) ||
chr(to_number(substr(hex_val,12,2),'XX')),
endpoint_actual_value
from    (
select
endpoint_number,
lag(endpoint_number,1,0) over(
order by endpoint_number
)                                                       prev_endpoint,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val,
endpoint_actual_value
from
dba_tab_histograms
where
owner = 'XXX'
and     table_name = 'YYY'
and     column_name = 'STATUS_COLUMN'
)
order by
endpoint_number
;

set doc off
doc

ENDPOINT_NUMBER  FREQUENCY HEX_VAL                         CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
40254      40254  434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED
40467        213  434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED
40592        125  44454C49564550D642CA2965000000 DELIVE DELIVERED
41304        712  494E564F49432991BF41C99E800000 INVOIC INVOICED
41336         32  4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW
41435          1  5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE
41478         43  5049434B4544013F0FF93F6EC00000 PICKED PICKED
41479          1  524546554E4436441DE2A321000000 REFUND REFUND MADE
41480          1  524546554E4436441DE2A321000000 REFUND REFUND PENDING
41482          2  52455455524E2F6693F753B6C00000 RETURN RETURNED

11 rows selected.

#
```

You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.

There are a number of side effects to the 15 byte / 32 character limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.

Footnote:

It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.

Update:

I’ve tested this code on versions up to 18.3.0.0 with the same results. In the course of the upgrade, though, the stored actual value extends from 32 bytes to 64 bytes for 12c, and the treatment of char() – as opposed to varchar2() – columns changed in 11.2.0.4 which means Oracle now strips trailing spaces from char() values and pads them with null (zero) bytes if they are shorter than the required 6 bytes.

1. Little typo in the footnote: you wrote “on the other hadn” instead of on “the other hand”.

If I understand Oracle 11g has a different algorithm, right?

Comment by lascoltodelvenerdi — October 7, 2010 @ 3:26 pm BST Oct 7,2010

• Thanks,
I’ve fixed the typo.

I haven’t seen a change in the algorithm for generating the content of a histogram in 11g – but I ran this example against 11.1, not 11.2.

Comment by Jonathan Lewis — October 7, 2010 @ 7:08 pm BST Oct 7,2010

2. […] an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and […]

Pingback by Frequency Histogram 5 « Oracle Scratchpad — October 13, 2010 @ 9:23 am BST Oct 13,2010

3. […] under: Statistics,Troubleshooting,Tuning — Jonathan Lewis @ 6:01 pm UTC Oct 19,2010 In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and […]

Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010

4. ```rajesh@10GR2> SELECT endpoint_number,
2    endpoint_number - NVL(prev_endpoint,0) frequency,
3    hex_values,
4    chr(to_number(SUBSTR(hex_values, 2,2),'XX'))
5    || chr(to_number(SUBSTR(hex_values, 4,2),'XX'))
6    || chr(to_number(SUBSTR(hex_values, 6,2),'XX'))
7    || chr(to_number(SUBSTR(hex_values, 8,2),'XX'))
8    || chr(to_number(SUBSTR(hex_values,10,2),'XX'))
9    || chr(to_number(SUBSTR(hex_values,12,2),'XX'))
10    || chr(to_number(SUBSTR(hex_values,14,2),'XX'))
11    || chr(to_number(SUBSTR(hex_values,16,2),'XX'))
12    || chr(to_number(SUBSTR(hex_values,18,2),'XX'))
13    || chr(to_number(SUBSTR(hex_values,20,2),'XX'))
14    || chr(to_number(SUBSTR(hex_values,22,2),'XX'))
15    || chr(to_number(SUBSTR(hex_values,24,2),'XX'))
16    as column_values
17  FROM
18    (SELECT endpoint_number,
19      lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint,
20      TO_CHAR(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_values,
21      endpoint_actual_value
22    FROM user_tab_histograms
23    WHERE table_name  = 'T'
24    AND column_name = 'OBJECT_NAME'
25    )
26  ORDER BY endpoint_number ;

ENDPOINT_NUMBER  FREQUENCY HEX_VALUES                      COLUMN_VALUE
--------------- ---------- ------------------------------- ------------
2          1  41534F5F53452F1F9B9266BEA00000 ASO_SE/?++f+
5          3  49525F5354474CEC7454244BE00000 IR_STGL8tT\$K
7          2  49585F53544751CD2477F0CDA00000 IX_STGQ-\$w=-
8          1  504B475F50523F5F66AC4457A00000 PKG_PR?_f¼DW
9          1  52505449524141D86FB6896A200000 RPTIRAA+o¦+j
10          1  52505449524141D86FB6896A200000 RPTIRAA+o¦+j

7 rows selected.

Elapsed: 00:00:00.90
rajesh@10GR2>
```

I am not getting the proper values in the column COLUMN_VALUE. Is this issue is due to Character set?

Comment by Rajeshwaran — January 21, 2011 @ 3:46 pm GMT Jan 21,2011

• Rajeshwaran,
This seems to be covered by my comment:

With a nice friendly single-byte character code, the first 5 characters will be extracted correctly, and the sixth will be pretty close to the original.

I stopped at six characters because I knew that the 7th onwards would be garbage.

Comment by Jonathan Lewis — January 21, 2011 @ 5:19 pm GMT Jan 21,2011

5. How did you arrived at the algorithm described above?

Is that available in Oracle product documentation? if yes, can you provide link to that?

Comment by Rajeshwaran — January 22, 2011 @ 8:39 am GMT Jan 22,2011

6. Jonathan,

I just came across this while I was working on a different issue – I think your description above how Oracle arrives at the numeric representation is not consistent with what you’ve outlined in “Cost Based Oracle – Fundamentals”.

The first step should read: Take the first 15 bytes, not six (Of course it’s obvious where the six comes from, but the result will not be consistent with Oracle’s result when using only the first six bytes).

By the way, for a original string less than six (single-byte) characters the last character can be incorrect due to the rounding applied, so you don’t end up necessarily with five correctly extracted characters sometimes.

Randolf

Comment by Randolf Geist — June 9, 2011 @ 8:27 pm BST Jun 9,2011

• Randolf,

Thanks for pointing out the error – now fixed. I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.

You’re right about the “less than six characters” of course. I actually gave an example of this with the “NEW / NEVÿÿÿ” switch in the example, but I’ve added in a comment to that effect where I make the point about the 6th character possibly being wrong.

Comment by Jonathan Lewis — June 10, 2011 @ 11:11 am BST Jun 10,2011

• > I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.

May be 0x20 = 32?

Randolf

Comment by Randolf Geist — June 10, 2011 @ 11:26 am BST Jun 10,2011

7. […] I’ve shown a method of converting the endpoint_value as it’s stored for character columns from its numeric form to its character equivalent, if you have ASCII codes at your finger tips you’ll spot lots of spaces (0×20) appended to the hex representation of the endpoint value – for details of how the transformation is actually performed see: https://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/ […]

Pingback by Histograms Part 1 – Why? – All Things Oracle — August 28, 2013 @ 4:22 pm BST Aug 28,2013

8. […] wrote a note a few years ago about translating the endpoint_value for histograms on character columns, and in that note I left […]

Pingback by endpoint_value | Oracle Scratchpad — May 5, 2014 @ 2:58 pm BST May 5,2014

9. Hi Jonathan,

I am trying to fake the histograms for a CHAR column. For a CHAR datatype column, when gather_table_stats was run, the endpoint_actual_value had the char values like ‘JUMBO BOX’, ‘JUMBO CAR’, etc. in user_tab_histograms table.

I am trying to write (fake) statistics as part of an experiment and I am using prepare_column_values and set_column_stats to create the histogram details, but I am not able to specify the endpoint_actual_value.

Could you help me with a simple example on how to write the endpoint_actual_value to oracle histograms?

Oracle Version : Oracle Express Edition 11g.

Thank you.

Comment by ashokeiisc — May 12, 2015 @ 8:05 am BST May 12,2015

• ashokeiisc,

Just as a first quick check – have you included in your array of values two values which match on at least the first 6 characters (as ‘JUMBO CAR’ and ‘JUMBO BOX’ have) ?

You get endpoint_actual_values recorded only if there are apparent duplicates in endpoint_values – and it typically requires about the first 6 characters to match in the original strings for this to happen.

Comment by Jonathan Lewis — May 12, 2015 @ 9:05 am BST May 12,2015

10. […] suspect the problem is related to the way that character histograms are built based on a numeric representation of the string that takes the first few characters of the string, […]

Pingback by Histogram Limit | Oracle Scratchpad — October 24, 2015 @ 7:12 pm BST Oct 24,2015

11. […] There’s a brief summary of the algorithm Oracle uses to generate values for character-based histograms at this URL. […]

Pingback by Histogram Upgrade | Oracle Scratchpad — December 2, 2016 @ 3:09 pm GMT Dec 2,2016

12. […] put it another way, how do we get 0.000157 as the selectivity of the predicate. We need to refer to a note I wrote a few years ago to help us on our way (with a little caveat due to a change that appeared in 11.2.0.4) – what […]

Pingback by Character selectivity | Oracle Scratchpad — April 14, 2017 @ 12:40 pm BST Apr 14,2017

13. […] has been populated in this example. This is because (with Oracle’s interesting transformation from character to numeric) the three ‘PPPPPPx’ character values turn into the same number – so Oracle […]

Pingback by Hybrid Fake | Oracle Scratchpad — October 10, 2018 @ 3:13 pm BST Oct 10,2018

14. […] vengono presi i primi 32 byte. Jonathan Lewis da più dettagli in una serie di post, anche se si riferisce ai “Frequency Histograms”, per cui dovrò indagare […]

Pingback by Istogrammi su tipo dato *CHAR – Oracle and other — December 20, 2018 @ 2:15 pm GMT Dec 20,2018

This site uses Akismet to reduce spam. Learn how your comment data is processed.