Oracle Scratchpad

October 13, 2010

Frequency Histogram 5

Filed under: Histograms,Statistics,Troubleshooting — Jonathan Lewis @ 9:21 am UTC Oct 13,2010

In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and warned you  that the strategy could lead to a couple of anomalies if you were unlucky. This note describes one such anomaly. We start with a slightly curious data set:

create table t1  (v1 varchar2(42));

insert	into t1
select	'short'
from	all_objects
where 	rownum <= 100
;

insert into t1
select	'shorter'
from 	all_objects
where 	rownum <= 300
;

insert into t1
select 	'shortest'
from 	all_objects
where 	rownum <= 500
;

insert into t1
select 	'shortest_thing_in_the_recorded_data_not'
from 	all_objects
where 	rownum <= 700
;

insert into t1
select 	'shortest_thing_in_the_recorded_data_really'
from 	all_objects
where 	rownum <= 900
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'t1',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 254'
	);
end;
/

If we run the script from the earlier posting to see what Oracle has stored, we get the following:

ENDPOINT_NUMBER  FREQUENCY HEX_VAL                         CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
            100        100  73686F72740018721DBD93B2E00000 short  short
            400        300  73686F727465871679E2CF40400000 shorte shorter
            900        500  73686F727465871679E2CF40400000 shorte shortest
           2500       1600  73686F727465871679E2CF40400000 shorte shortest_thing_in_the_recorded_d

Because two of our values were identical to the first 32 characters Oracle has recorded them as the same, and stored them under the same (incomplete) entry. So what happens when you query for a value that isn’t in the table, and doesn’t get mentioned in the histogram – but looks similar to the other two long items ?


set autotrace traceonly explain

select
	count(*)
from
	t1
where
	v1 = 'shortest_thing_in_the_recorded_data'
;

select
	count(*)
from
	t1
where
	v1 = 'shortest_thing_in_the_data'
;

set autotrace off

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    29 |     4 |
|   1 |  SORT AGGREGATE    |      |     1 |    29 |       |
|*  2 |   TABLE ACCESS FULL| T1   |  1600 | 46400 |     4 |
-----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1"='shortest_thing_in_the_recorded_data')

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    29 |     4 |
|   1 |  SORT AGGREGATE    |      |     1 |    29 |       |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1450 |     4 |
-----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1"='shortest_thing_in_the_data')

Neither of the two strings in my predicate appear in the data or in the histogram – but the first string (after applying the histogram algorithm) is a match for an entry in the histogram so Oracle reports the value derived from the matching endpoint_number as the cardinality. The second string doesn’t appear even after applying the algorithm – so Oracle has supplied a cardinality of 50, which is half the cardinality of the least frequently occuring value that it can find in the histogram. (Note – if you are running 10.2.0.3 or earlier the cardinality for this case would be 1; the optimizer changed in 10.2.0.4).

This means we have demonstrated a situation where two “rare” values which should be treated identically end up with dramatically different cardinalities estimates and could, therefore, end up being subject to dramatically different execution plans.

Conclusion: If you have fairly long, and similar, strings in a column that is a good candidate for a frequency histogram (e.g. a very descriptive status column) then you have a problem if a value that is very rare looks identical to a very popular value up to the first 32 characters. You may find that the only solution is to change the list of legal values (although various strategies involving virtual columns or function-based indexes can bypass the problem).

[Further reading on Histograms]

3 Comments »

  1. [quote]
    …Oracle reports the histogram endpoint_number as the cardinality…
    [/quote]

    Just a nitpick: It seems from looking at the plan and the script output that Oracle reports the frequency (1600), not the endpoint_number (2500).

    Comment by Flado — October 13, 2010 @ 2:49 pm UTC Oct 13,2010 | Reply

  2. [...] the strategy could lead to a couple of anomalies if you were unlucky. I’ve already published a note about one such anomaly that can occur with fairly long character strings, this note describes another anomaly that could [...]

    Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am UTC Oct 20,2010 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,394 other followers