In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.
The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:
- Take the first 15 bytes of the string (after padding the string with nulls (for varchar2) or spaces (for char))
- Treat the result as a hexadecimal number, and convert to decimal
to 15 significant digitsusing round(N, -21) and store as the endpoint_value
- If duplicate rows appear, store the first 32 characters (increased to 64 for 12c) of each string as the endpoint_actual_value
Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).
With a nice friendly single-byte character code, the first 5 characters will be extracted correctly (assuming there are at least 5 characters in the string), and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):
rem rem How to read a frequency histogram on a character column rem select endpoint_number, endpoint_number - nvl(prev_endpoint,0) frequency, hex_val, chr(to_number(substr(hex_val, 2,2),'XX')) || chr(to_number(substr(hex_val, 4,2),'XX')) || chr(to_number(substr(hex_val, 6,2),'XX')) || chr(to_number(substr(hex_val, 8,2),'XX')) || chr(to_number(substr(hex_val,10,2),'XX')) || chr(to_number(substr(hex_val,12,2),'XX')), endpoint_actual_value from ( select endpoint_number, lag(endpoint_number,1) over( order by endpoint_number ) prev_endpoint, to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val, endpoint_actual_value from dba_tab_histograms where owner = 'XXX' and table_name = 'YYY' and column_name = 'STATUS_COLUMN' ) order by endpoint_number ; set doc off doc ENDPOINT_NUMBER FREQUENCY HEX_VAL CHR(TO ENDPOINT_ACTUAL_VALUE --------------- ---------- ------------------------------- ------ ------------------------------------------ 40254 40254 434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED 40467 213 434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED 40592 125 44454C49564550D642CA2965000000 DELIVE DELIVERED 41304 712 494E564F49432991BF41C99E800000 INVOIC INVOICED 41336 32 4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW 41434 98 5041494400000C08C1A415AD800000 PAID PAID 41435 1 5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE 41478 43 5049434B4544013F0FF93F6EC00000 PICKED PICKED 41479 1 524546554E4436441DE2A321000000 REFUND REFUND MADE 41480 1 524546554E4436441DE2A321000000 REFUND REFUND PENDING 41482 2 52455455524E2F6693F753B6C00000 RETURN RETURNED 11 rows selected. #
You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.
There are a number of side effects to the 15 byte / 32 character limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.
Footnote: It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.