In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and warned you that the strategy could lead to a couple of anomalies if you were unlucky. I’ve already published a note about one such anomaly that can occur with fairly long character strings, this note describes another anomaly that could appear in less extreme cases. Again, we start by constructing a data set.
create table t1 (v1 varchar2(42));
insert into t1
select 'next day'
from all_objects
where rownum <= 150
;
insert into t1
select 'same day'
from all_objects
where rownum <= 350
;
insert into t1
select 'priority high'
from all_objects
where rownum <= 500
;
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname =>'t1',
estimate_percent => 100,
method_opt => 'for all columns size 254'
);
end;
/
select
endpoint_number,
endpoint_number - nvl(prev_endpoint,0) frequency,
hex_val,
chr(to_number(substr(hex_val, 2,2),'XX')) ||
chr(to_number(substr(hex_val, 4,2),'XX')) ||
chr(to_number(substr(hex_val, 6,2),'XX')) ||
chr(to_number(substr(hex_val, 8,2),'XX')) ||
chr(to_number(substr(hex_val,10,2),'XX')) ||
chr(to_number(substr(hex_val,12,2),'XX')),
endpoint_actual_value
from (
select
endpoint_number,
lag(endpoint_number,1) over(
order by endpoint_number
) prev_endpoint,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val,
endpoint_actual_value
from
user_tab_histograms
where
table_name = 'T1'
and column_name = 'V1'
)
order by
endpoint_number
;
ENDPOINT_NUMBER FREQUENCY HEX_VAL CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
150 150 6E65787420644D6D50DD04A6400000 next d
650 500 7072696F72698970D6651967200000 priori
1000 350 73616D6520645C36B7AD479D600000 same d
3 rows selected.
This example captures a histogram at a point when my data set has only three popular values in it, and those values are sufficiently different that their “6-byte encoding” is enough for Oracle to tell the difference between them so the histogram doesn’t use the endpoint_actual_value column.
But I have a problem. During the course of the working day I introduce some data with different status codes, and all the queries I run are about these status codes, but by the end of the day (when, possibly, the stats will be collected again by the automatic stats collection job) all the “interesting” data has been updated to one of the three popular values.
What does Oracle do about queries for my “rare” values ? It depends what they look like. Here’s a couple of examples from 10.2.0.3:
select count(*) from t1 where v1 = 'overnight' ; select count(*) from t1 where v1 = 'priority low' ; set autotrace off
Neither of these values is in the histogram, so I would like Oracle to give a low cardinality estimate to them. But one of the values is very similar to a value that DOES exist in the histogram. This leads to an interesting variation in the numbers that appear in the execution plans:
-----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 11 | 2 |
| 1 | SORT AGGREGATE | | 1 | 11 | |
|* 2 | TABLE ACCESS FULL| T1 | 1 | 11 | 2 |
-----------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("V1"='overnight')
Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953
-----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 11 | 2 |
| 1 | SORT AGGREGATE | | 1 | 11 | |
|* 2 | TABLE ACCESS FULL| T1 | 500 | 5500 | 2 |
-----------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("V1"='priority low')
In the first case (‘overnight’) the optimizer has decided that the value doesn’t exist in the histogram. Since this database is running 10.2.0.3 it has set the resulting cardinality to 1 (this would be 75 in 10.2.0.4 or 11g – half the lowest known value in the histogram). But in the second case (‘priority low’) the optimizer has done its 6-byte manipulation and decided that ‘priority low’ matches ‘priority high’, and given it a cardinality of 500.
If Oracle had seen ‘priority low’ when creating the histogram it would have used the endpoint_actual_value column and generated two separate rows for ‘priority low’ and ‘priority high’. But it is perfectly feasible that because of timing (the interesting data exists only briefly during the day) or sampling (the interesting data is tiny compared to the popular values) you could see Oracle taking a “large data” execution path when you know that you’re only expecting to handle a small data set.
The solution, of course, is to write code to create a histogram that represents the situation as you want Oracle to see it.

I’ve just had an email pointing out that the link to this article held some odd characters – so I’ve just changed the permalink. Apologies to anyone if this messes up things like RSS feeds.
Comment by Jonathan Lewis — October 20, 2010 @ 8:14 am UTC Oct 20,2010 |