In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and warned you that the strategy could lead to a couple of anomalies if you were unlucky. This note describes one such anomaly. We start with a slightly curious data set:
rem rem Script: freq_hist_char.sql rem Author: Jonathan Lewis rem Dated: Nov 2007 rem create table t1 (v1 varchar2(42)); insert into t1 select 'short' from all_objects where rownum <= 100 -- > hint to avoid wordpress format issue ; insert into t1 select 'shorter' from all_objects where rownum <= 300 -- > hint to avoid wordpress format issue ; insert into t1 select 'shortest' from all_objects where rownum <= 500 -- > hint to avoid wordpress format issue ; insert into t1 select 'shortest_thing_in_the_recorded_data_not' from all_objects where rownum <= 700 -- > hint to avoid wordpress format issue ; insert into t1 select 'shortest_thing_in_the_recorded_data_really' from all_objects where rownum <= 900 -- > hint to avoid wordpress format issue ; begin dbms_stats.gather_table_stats( ownname => user, tabname =>'t1', estimate_percent => 100, method_opt => 'for all columns size 254' ); end; /
If we run the script from the earlier posting to see what Oracle has stored, we get the following[***]:
ENDPOINT_NUMBER FREQUENCY HEX_VAL CHR(TO ENDPOINT_ACTUAL_VALUE --------------- ---------- ------------------------------- ------ ------------------------------------------ 100 100 73686F72740018721DBD93B2E00000 short short 400 300 73686F727465871679E2CF40400000 shorte shorter 900 500 73686F727465871679E2CF40400000 shorte shortest 2500 1600 73686F727465871679E2CF40400000 shorte shortest_thing_in_the_recorded_d
*** Update: This changes in Oracle 12c, which stores up to 64 bytes for the endpoint_actual_value. The principle, however, doesn’t change; only the length of character strings before you run into the problem.
Because two of our values were identical to the first 32 characters Oracle has recorded them as the same, and stored them under the same (incomplete) entry. So what happens when you query for a value that isn’t in the table, and doesn’t get mentioned in the histogram – but looks similar to the other two long items ?
set autotrace traceonly explain select count(*) from t1 where v1 = 'shortest_thing_in_the_recorded_data' ; select count(*) from t1 where v1 = 'shortest_thing_in_the_data' ; set autotrace off ----------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ----------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 29 | 4 | | 1 | SORT AGGREGATE | | 1 | 29 | | |* 2 | TABLE ACCESS FULL| T1 | 1600 | 46400 | 4 | ----------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("V1"='shortest_thing_in_the_recorded_data') ----------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | ----------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 29 | 4 | | 1 | SORT AGGREGATE | | 1 | 29 | | |* 2 | TABLE ACCESS FULL| T1 | 50 | 1450 | 4 | ----------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("V1"='shortest_thing_in_the_data')
Neither of the two strings in my predicate appears in the data or in the histogram – but the first string (after applying the histogram algorithm) is a match for an entry in the histogram so Oracle reports the value derived from the matching endpoint_number as the cardinality. The second string doesn’t appear even after applying the algorithm – so Oracle has supplied a cardinality of 50, which is half the cardinality of the least frequently occuring value that it can find in the histogram. (Note – if you are running 10.2.0.3 or earlier the cardinality for this case would be 1; the optimizer changed in 10.2.0.4).
This means we have demonstrated a situation where two “rare” values which should be treated identically end up with dramatically different cardinalities estimates and could, therefore, end up being subject to dramatically different execution plans.
Conclusion:
If you have fairly long, and similar, strings in a column that is a good candidate for a frequency histogram (e.g. a very descriptive status column) then you have a problem if a value that is very rare looks identical to a very popular value up to the first 32 (64 in 12c) characters. You may find that the only solution is to change the list of legal values (although various strategies involving virtual columns or function-based indexes can bypass the problem).
Update for 12c – Aug 2013
One of the many small changes in 12c that will affect some code is that the code for handling the endpoint_actual_value has changed. The definition for the column has increased from varchar2(1000) to varchar2(4000), but more significantly the endpoint_actual_value now holds up to the first 64 characters of the string – so my example above behaves “properly” in 12c – Oracle recognises that the string ‘shortest_thing_in_the_recorded_data’ isn’t in the data set. If you want to recreate the problem you simply need to make the matching sections of the strings 32 characters longer.
Update May 2014
Thanks to a response to an example of this type of problem appearing on OTN (all values distinct, but only after the first 35 bytes) I’ve learned that there’s a patch for this problem (which is fixed in 12.2). The note happens to use a an example with nvarchar2() but that’s not an important variation on the concept: Bug 18377553 : POOR CARDINALITY ESTIMATE WITH HISTOGRAMS AND VALUES > 32 BYTES
[…] the strategy could lead to a couple of anomalies if you were unlucky. I’ve already published a note about one such anomaly that can occur with fairly long character strings, this note describes another anomaly that could […]
Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010 |
[…] Effects of long characters strings in frequency histograms is better: improved. […]
Pingback by 12 Histogram fixes | Oracle Scratchpad — August 4, 2013 @ 6:52 pm BST Aug 4,2013 |
[…] Long strings problem – until 12c Oracle stored at most 32 bytes of a string in the endpoint_actual_value column. […]
Pingback by Faking Histograms | Oracle Scratchpad — October 15, 2018 @ 1:37 pm BST Oct 15,2018 |
[…] the cardinality calculations can go wrong. If two character match over the first few characters the numeric representation of those strings that Oracle uses in a histogram is identical, and if they are long enough even the […]
Pingback by Misdirection | Oracle Scratchpad — December 7, 2018 @ 11:48 am GMT Dec 7,2018 |