Oracle Scratchpad

October 13, 2010

Frequency Histogram 5

Filed under: Histograms,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 9:21 am GMT Oct 13,2010

In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and warned you  that the strategy could lead to a couple of anomalies if you were unlucky. This note describes one such anomaly. We start with a slightly curious data set:

create table t1  (v1 varchar2(42));

insert	into t1
select	'short'
from	all_objects
where 	rownum <= 100
;

insert into t1
select	'shorter'
from 	all_objects
where 	rownum <= 300
;

insert into t1
select 	'shortest'
from 	all_objects
where 	rownum <= 500
;

insert into t1
select 	'shortest_thing_in_the_recorded_data_not'
from 	all_objects
where 	rownum <= 700
;

insert into t1
select 	'shortest_thing_in_the_recorded_data_really'
from 	all_objects
where 	rownum <= 900 ; begin 	dbms_stats.gather_table_stats( 		ownname		 => user,
		tabname		 =>'t1',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 254'
	);
end;
/

If we run the script from the earlier posting to see what Oracle has stored, we get the following:

ENDPOINT_NUMBER  FREQUENCY HEX_VAL                         CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
            100        100  73686F72740018721DBD93B2E00000 short  short
            400        300  73686F727465871679E2CF40400000 shorte shorter
            900        500  73686F727465871679E2CF40400000 shorte shortest
           2500       1600  73686F727465871679E2CF40400000 shorte shortest_thing_in_the_recorded_d

Because two of our values were identical to the first 32 characters Oracle has recorded them as the same, and stored them under the same (incomplete) entry. So what happens when you query for a value that isn’t in the table, and doesn’t get mentioned in the histogram – but looks similar to the other two long items ?


set autotrace traceonly explain

select
	count(*)
from
	t1
where
	v1 = 'shortest_thing_in_the_recorded_data'
;

select
	count(*)
from
	t1
where
	v1 = 'shortest_thing_in_the_data'
;

set autotrace off

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    29 |     4 |
|   1 |  SORT AGGREGATE    |      |     1 |    29 |       |
|*  2 |   TABLE ACCESS FULL| T1   |  1600 | 46400 |     4 |
-----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1"='shortest_thing_in_the_recorded_data')

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    29 |     4 |
|   1 |  SORT AGGREGATE    |      |     1 |    29 |       |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1450 |     4 |
-----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1"='shortest_thing_in_the_data')

Neither of the two strings in my predicate appear in the data or in the histogram – but the first string (after applying the histogram algorithm) is a match for an entry in the histogram so Oracle reports the value derived from the matching endpoint_number as the cardinality. The second string doesn’t appear even after applying the algorithm – so Oracle has supplied a cardinality of 50, which is half the cardinality of the least frequently occuring value that it can find in the histogram. (Note – if you are running 10.2.0.3 or earlier the cardinality for this case would be 1; the optimizer changed in 10.2.0.4).

This means we have demonstrated a situation where two “rare” values which should be treated identically end up with dramatically different cardinalities estimates and could, therefore, end up being subject to dramatically different execution plans.

Conclusion: If you have fairly long, and similar, strings in a column that is a good candidate for a frequency histogram (e.g. a very descriptive status column) then you have a problem if a value that is very rare looks identical to a very popular value up to the first 32 characters. You may find that the only solution is to change the list of legal values (although various strategies involving virtual columns or function-based indexes can bypass the problem).

Update for 12c - Aug 2013

One of the many small changes in 12c that will affect some code is that the code for handling the endpoint_actual_value has changed. The definition for the column has increased from varchar2(1000) to varchar2(4000), but more significantly the endpoint_actual_value now holds up to the first 64 characters of the string  – so my example above behaves “properly” in 12c – Oracle recognises that the string ‘shortest_thing_in_the_recorded_data’ isn’t in the data set. If you want to recreate the problem you simply need to make the matching sections of the strings 32 characters longer.

Update May 2014

Thanks to a response to an example of this type of problem appearing on OTN (all values distinct, but only after the first 35 bytes) I’ve learned that there’s a patch for this problem (which is fixed in 12.2). The note happens to use a an example with nvarchar2() but that’s not an important variation on the concept: Bug 18377553 : POOR CARDINALITY ESTIMATE WITH HISTOGRAMS AND VALUES > 32 BYTES

 

[Further reading on Histograms]

3 Comments »

  1. [...] the strategy could lead to a couple of anomalies if you were unlucky. I’ve already published a note about one such anomaly that can occur with fairly long character strings, this note describes another anomaly that could [...]

    Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am GMT Oct 20,2010 | Reply

  2. There’s a little update to this example describing a change in 12c that improves the optimizer’s use of frequency histograms where there are lots of “long, similar, strings” in the column.

    Comment by Jonathan Lewis — August 2, 2013 @ 10:42 am GMT Aug 2,2013 | Reply

  3. […] Effects of long characters strings in frequency histograms is better: improved. […]

    Pingback by 12 Histogram fixes | Oracle Scratchpad — August 4, 2013 @ 6:52 pm GMT Aug 4,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,422 other followers