In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.
The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:
- Take the first 15 bytes of the string (after padding the string with nulls (for varchar2()) or spaces (for char() – until 11.2.0.4))
- Treat the result as a hexadecimal number, and convert to decimal
- Round
to 15 significant digitsusing round(N, -21) and store as the endpoint_value - If duplicate rows appear, store the first 32 characters (increased to 64 for 12c) of each string as the endpoint_actual_value
Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).
With a nice friendly single-byte character code, the first 5 characters will be extracted correctly (assuming there are at least 5 characters in the string), and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):
rem rem Script: read_char_hist.sql rem Author: Jonathan Lewis rem Dated: Oct 2010 rem rem How to read a frequency histogram on a character column rem column endpoint_actual_value format a32 select endpoint_number cum_frequency, endpoint_number - prev_endpoint frequency, hex_val, chr(to_number(substr(hex_val, 2,2),'XX')) || chr(to_number(substr(hex_val, 4,2),'XX')) || chr(to_number(substr(hex_val, 6,2),'XX')) || chr(to_number(substr(hex_val, 8,2),'XX')) || chr(to_number(substr(hex_val,10,2),'XX')) || chr(to_number(substr(hex_val,12,2),'XX')), endpoint_actual_value from ( select endpoint_number, lag(endpoint_number,1,0) over( order by endpoint_number ) prev_endpoint, to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val, endpoint_actual_value from dba_tab_histograms where owner = 'XXX' and table_name = 'YYY' and column_name = 'STATUS_COLUMN' ) order by endpoint_number ; set doc off doc ENDPOINT_NUMBER FREQUENCY HEX_VAL CHR(TO ENDPOINT_ACTUAL_VALUE --------------- ---------- ------------------------------- ------ ------------------------------------------ 40254 40254 434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED 40467 213 434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED 40592 125 44454C49564550D642CA2965000000 DELIVE DELIVERED 41304 712 494E564F49432991BF41C99E800000 INVOIC INVOICED 41336 32 4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW 41434 98 5041494400000C08C1A415AD800000 PAID PAID 41435 1 5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE 41478 43 5049434B4544013F0FF93F6EC00000 PICKED PICKED 41479 1 524546554E4436441DE2A321000000 REFUND REFUND MADE 41480 1 524546554E4436441DE2A321000000 REFUND REFUND PENDING 41482 2 52455455524E2F6693F753B6C00000 RETURN RETURNED 11 rows selected. #
You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.
There are a number of side effects to the 15 byte / 32 character limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.
Footnote:
It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.
Update:
I’ve tested this code on versions up to 18.3.0.0 with the same results. In the course of the upgrade, though, the stored actual value extends from 32 bytes to 64 bytes for 12c, and the treatment of char() – as opposed to varchar2() – columns changed in 11.2.0.4 which means Oracle now strips trailing spaces from char() values and pads them with null (zero) bytes if they are shorter than the required 6 bytes.
Little typo in the footnote: you wrote “on the other hadn” instead of on “the other hand”.
If I understand Oracle 11g has a different algorithm, right?
Comment by lascoltodelvenerdi — October 7, 2010 @ 3:26 pm BST Oct 7,2010 |
Thanks,
I’ve fixed the typo.
I haven’t seen a change in the algorithm for generating the content of a histogram in 11g – but I ran this example against 11.1, not 11.2.
Comment by Jonathan Lewis — October 7, 2010 @ 7:08 pm BST Oct 7,2010 |
[…] an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and […]
Pingback by Frequency Histogram 5 « Oracle Scratchpad — October 13, 2010 @ 9:23 am BST Oct 13,2010 |
[…] under: Statistics,Troubleshooting,Tuning — Jonathan Lewis @ 6:01 pm UTC Oct 19,2010 In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and […]
Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010 |
I am not getting the proper values in the column COLUMN_VALUE. Is this issue is due to Character set?
Comment by Rajeshwaran — January 21, 2011 @ 3:46 pm GMT Jan 21,2011 |
Rajeshwaran,
This seems to be covered by my comment:
With a nice friendly single-byte character code, the first 5 characters will be extracted correctly, and the sixth will be pretty close to the original.
I stopped at six characters because I knew that the 7th onwards would be garbage.
Comment by Jonathan Lewis — January 21, 2011 @ 5:19 pm GMT Jan 21,2011 |
How did you arrived at the algorithm described above?
Is that available in Oracle product documentation? if yes, can you provide link to that?
Comment by Rajeshwaran — January 22, 2011 @ 8:39 am GMT Jan 22,2011 |
I created a series of test cases, and proposed the simplest explanation for the results.
I published some details in “Cost Based Oracle – Fundamentals.”
Comment by Jonathan Lewis — January 23, 2011 @ 2:21 pm GMT Jan 23,2011 |
Page 117 more exactly. That helped me explain a client strange behaviour. Thanks for this Jonathan!
Comment by Eugen IACOB — January 30, 2011 @ 9:18 pm GMT Jan 30,2011 |
Jonathan,
I just came across this while I was working on a different issue – I think your description above how Oracle arrives at the numeric representation is not consistent with what you’ve outlined in “Cost Based Oracle – Fundamentals”.
The first step should read: Take the first 15 bytes, not six (Of course it’s obvious where the six comes from, but the result will not be consistent with Oracle’s result when using only the first six bytes).
By the way, for a original string less than six (single-byte) characters the last character can be incorrect due to the rounding applied, so you don’t end up necessarily with five correctly extracted characters sometimes.
Randolf
Comment by Randolf Geist — June 9, 2011 @ 8:27 pm BST Jun 9,2011 |
Randolf,
Thanks for pointing out the error – now fixed. I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.
You’re right about the “less than six characters” of course. I actually gave an example of this with the “NEW / NEVÿÿÿ” switch in the example, but I’ve added in a comment to that effect where I make the point about the 6th character possibly being wrong.
Comment by Jonathan Lewis — June 10, 2011 @ 11:11 am BST Jun 10,2011 |
> I had also made a comment in the same line about Oracle extracting the first 20 bytes, and I can’t think where I got that from.
May be 0x20 = 32?
Randolf
Comment by Randolf Geist — June 10, 2011 @ 11:26 am BST Jun 10,2011 |
[…] I’ve shown a method of converting the endpoint_value as it’s stored for character columns from its numeric form to its character equivalent, if you have ASCII codes at your finger tips you’ll spot lots of spaces (0×20) appended to the hex representation of the endpoint value – for details of how the transformation is actually performed see: https://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/ […]
Pingback by Histograms Part 1 – Why? – All Things Oracle — August 28, 2013 @ 4:22 pm BST Aug 28,2013 |
[…] wrote a note a few years ago about translating the endpoint_value for histograms on character columns, and in that note I left […]
Pingback by endpoint_value | Oracle Scratchpad — May 5, 2014 @ 2:58 pm BST May 5,2014 |
Hi Jonathan,
I am trying to fake the histograms for a CHAR column. For a CHAR datatype column, when gather_table_stats was run, the endpoint_actual_value had the char values like ‘JUMBO BOX’, ‘JUMBO CAR’, etc. in user_tab_histograms table.
I am trying to write (fake) statistics as part of an experiment and I am using prepare_column_values and set_column_stats to create the histogram details, but I am not able to specify the endpoint_actual_value.
Could you help me with a simple example on how to write the endpoint_actual_value to oracle histograms?
Oracle Version : Oracle Express Edition 11g.
Thank you.
Comment by ashokeiisc — May 12, 2015 @ 8:05 am BST May 12,2015 |
ashokeiisc,
Just as a first quick check – have you included in your array of values two values which match on at least the first 6 characters (as ‘JUMBO CAR’ and ‘JUMBO BOX’ have) ?
You get endpoint_actual_values recorded only if there are apparent duplicates in endpoint_values – and it typically requires about the first 6 characters to match in the original strings for this to happen.
Comment by Jonathan Lewis — May 12, 2015 @ 9:05 am BST May 12,2015 |
[…] suspect the problem is related to the way that character histograms are built based on a numeric representation of the string that takes the first few characters of the string, […]
Pingback by Histogram Limit | Oracle Scratchpad — October 24, 2015 @ 7:12 pm BST Oct 24,2015 |
[…] https://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/ […]
Pingback by Execution Plan: Como interpretar os valores estatísticos estimados? | DBTimeWizard — December 6, 2015 @ 10:32 pm GMT Dec 6,2015 |
[…] There’s a brief summary of the algorithm Oracle uses to generate values for character-based histograms at this URL. […]
Pingback by Histogram Upgrade | Oracle Scratchpad — December 2, 2016 @ 3:09 pm GMT Dec 2,2016 |
[…] put it another way, how do we get 0.000157 as the selectivity of the predicate. We need to refer to a note I wrote a few years ago to help us on our way (with a little caveat due to a change that appeared in 11.2.0.4) – what […]
Pingback by Character selectivity | Oracle Scratchpad — April 14, 2017 @ 12:40 pm BST Apr 14,2017 |
[…] has been populated in this example. This is because (with Oracle’s interesting transformation from character to numeric) the three ‘PPPPPPx’ character values turn into the same number – so Oracle […]
Pingback by Hybrid Fake | Oracle Scratchpad — October 10, 2018 @ 3:13 pm BST Oct 10,2018 |
[…] vengono presi i primi 32 byte. Jonathan Lewis da più dettagli in una serie di post, anche se si riferisce ai “Frequency Histograms”, per cui dovrò indagare […]
Pingback by Istogrammi su tipo dato *CHAR – Oracle and other — December 20, 2018 @ 2:15 pm GMT Dec 20,2018 |