In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and warned you that the strategy could lead to a couple of anomalies if you were unlucky. This note describes one such anomaly. We start with a slightly curious data set:
create table t1 (v1 varchar2(42)); insert into t1 select 'short' from all_objects where rownum <= 100 ; insert into t1 select 'shorter' from all_objects where rownum <= 300 ; insert into t1 select 'shortest' from all_objects where rownum <= 500 ; insert into t1 select 'shortest_thing_in_the_recorded_data_not' from all_objects where rownum <= 700 ; insert into t1 select 'shortest_thing_in_the_recorded_data_really' from all_objects where rownum <= 900 ; begin dbms_stats.gather_table_stats( ownname => user, tabname =>'t1', estimate_percent => 100, method_opt => 'for all columns size 254' ); end; /
If we run the script from the earlier posting to see what Oracle has stored, we get the following:
ENDPOINT_NUMBER FREQUENCY HEX_VAL CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
100 100 73686F72740018721DBD93B2E00000 short short
400 300 73686F727465871679E2CF40400000 shorte shorter
900 500 73686F727465871679E2CF40400000 shorte shortest
2500 1600 73686F727465871679E2CF40400000 shorte shortest_thing_in_the_recorded_d
Because two of our values were identical to the first 32 characters Oracle has recorded them as the same, and stored them under the same (incomplete) entry. So what happens when you query for a value that isn’t in the table, and doesn’t get mentioned in the histogram – but looks similar to the other two long items ?
set autotrace traceonly explain
select
count(*)
from
t1
where
v1 = 'shortest_thing_in_the_recorded_data'
;
select
count(*)
from
t1
where
v1 = 'shortest_thing_in_the_data'
;
set autotrace off
-----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 29 | 4 |
| 1 | SORT AGGREGATE | | 1 | 29 | |
|* 2 | TABLE ACCESS FULL| T1 | 1600 | 46400 | 4 |
-----------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("V1"='shortest_thing_in_the_recorded_data')
-----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 29 | 4 |
| 1 | SORT AGGREGATE | | 1 | 29 | |
|* 2 | TABLE ACCESS FULL| T1 | 50 | 1450 | 4 |
-----------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("V1"='shortest_thing_in_the_data')
Neither of the two strings in my predicate appear in the data or in the histogram – but the first string (after applying the histogram algorithm) is a match for an entry in the histogram so Oracle reports the value derived from the matching endpoint_number as the cardinality. The second string doesn’t appear even after applying the algorithm – so Oracle has supplied a cardinality of 50, which is half the cardinality of the least frequently occuring value that it can find in the histogram. (Note – if you are running 10.2.0.3 or earlier the cardinality for this case would be 1; the optimizer changed in 10.2.0.4).
This means we have demonstrated a situation where two “rare” values which should be treated identically end up with dramatically different cardinalities estimates and could, therefore, end up being subject to dramatically different execution plans.
Conclusion: If you have fairly long, and similar, strings in a column that is a good candidate for a frequency histogram (e.g. a very descriptive status column) then you have a problem if a value that is very rare looks identical to a very popular value up to the first 32 characters. You may find that the only solution is to change the list of legal values (although various strategies involving virtual columns or function-based indexes can bypass the problem).

[quote]
…Oracle reports the histogram endpoint_number as the cardinality…
[/quote]
Just a nitpick: It seems from looking at the plan and the script output that Oracle reports the frequency (1600), not the endpoint_number (2500).
Comment by Flado — October 13, 2010 @ 2:49 pm UTC Oct 13,2010 |
Flado,
Thanks for that. I didn’t say quite what I had intended – I’ve corrected the statement.
Comment by Jonathan Lewis — October 14, 2010 @ 3:21 pm UTC Oct 14,2010 |
[...] the strategy could lead to a couple of anomalies if you were unlucky. I’ve already published a note about one such anomaly that can occur with fairly long character strings, this note describes another anomaly that could [...]
Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am UTC Oct 20,2010 |