Frequency Histogram 5

October 13, 2010

Frequency Histogram 5

Filed under: Histograms,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 9:21 am BST Oct 13,2010

In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and warned you that the strategy could lead to a couple of anomalies if you were unlucky. This note describes one such anomaly. We start with a slightly curious data set:


rem
rem     Script:         freq_hist_char.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2007
rem

create table t1  (v1 varchar2(42));

insert	into t1
select	'short'
from	all_objects
where 	rownum <= 100 -- > hint to avoid wordpress format issue
;

insert into t1
select	'shorter'
from 	all_objects
where 	rownum <= 300 -- > hint to avoid wordpress format issue
;

insert into t1
select 	'shortest'
from 	all_objects
where 	rownum <= 500 -- > hint to avoid wordpress format issue
;

insert into t1
select 	'shortest_thing_in_the_recorded_data_not'
from 	all_objects
where 	rownum <= 700 -- > hint to avoid wordpress format issue
;

insert into t1
select 	'shortest_thing_in_the_recorded_data_really'
from 	all_objects
where 	rownum <= 900 -- > hint to avoid wordpress format issue
;

begin
	dbms_stats.gather_table_stats(
 		ownname		 => user,
		tabname		 =>'t1',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 254'
	);
end;
/

If we run the script from the earlier posting to see what Oracle has stored, we get the following^[***]:

ENDPOINT_NUMBER  FREQUENCY HEX_VAL                         CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
            100        100  73686F72740018721DBD93B2E00000 short  short
            400        300  73686F727465871679E2CF40400000 shorte shorter
            900        500  73686F727465871679E2CF40400000 shorte shortest
           2500       1600  73686F727465871679E2CF40400000 shorte shortest_thing_in_the_recorded_d

*** Update: This changes in Oracle 12c, which stores up to 64 bytes for the endpoint_actual_value. The principle, however, doesn’t change; only the length of character strings before you run into the problem.

Because two of our values were identical to the first 32 characters Oracle has recorded them as the same, and stored them under the same (incomplete) entry. So what happens when you query for a value that isn’t in the table, and doesn’t get mentioned in the histogram – but looks similar to the other two long items ?


set autotrace traceonly explain

select
	count(*)
from
	t1
where
	v1 = 'shortest_thing_in_the_recorded_data'
;

select
	count(*)
from
	t1
where
	v1 = 'shortest_thing_in_the_data'
;

set autotrace off

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    29 |     4 |
|   1 |  SORT AGGREGATE    |      |     1 |    29 |       |
|*  2 |   TABLE ACCESS FULL| T1   |  1600 | 46400 |     4 |
-----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1"='shortest_thing_in_the_recorded_data')

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    29 |     4 |
|   1 |  SORT AGGREGATE    |      |     1 |    29 |       |
|*  2 |   TABLE ACCESS FULL| T1   |    50 |  1450 |     4 |
-----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1"='shortest_thing_in_the_data')

Neither of the two strings in my predicate appears in the data or in the histogram – but the first string (after applying the histogram algorithm) is a match for an entry in the histogram so Oracle reports the value derived from the matching endpoint_number as the cardinality. The second string doesn’t appear even after applying the algorithm – so Oracle has supplied a cardinality of 50, which is half the cardinality of the least frequently occuring value that it can find in the histogram. (Note – if you are running 10.2.0.3 or earlier the cardinality for this case would be 1; the optimizer changed in 10.2.0.4).

This means we have demonstrated a situation where two “rare” values which should be treated identically end up with dramatically different cardinalities estimates and could, therefore, end up being subject to dramatically different execution plans.

Conclusion:

If you have fairly long, and similar, strings in a column that is a good candidate for a frequency histogram (e.g. a very descriptive status column) then you have a problem if a value that is very rare looks identical to a very popular value up to the first 32 (64 in 12c) characters. You may find that the only solution is to change the list of legal values (although various strategies involving virtual columns or function-based indexes can bypass the problem).

Update for 12c – Aug 2013

One of the many small changes in 12c that will affect some code is that the code for handling the endpoint_actual_value has changed. The definition for the column has increased from varchar2(1000) to varchar2(4000), but more significantly the endpoint_actual_value now holds up to the first 64 characters of the string – so my example above behaves “properly” in 12c – Oracle recognises that the string ‘shortest_thing_in_the_recorded_data’ isn’t in the data set. If you want to recreate the problem you simply need to make the matching sections of the strings 32 characters longer.

Update May 2014

Thanks to a response to an example of this type of problem appearing on OTN (all values distinct, but only after the first 35 bytes) I’ve learned that there’s a patch for this problem (which is fixed in 12.2). The note happens to use a an example with nvarchar2() but that’s not an important variation on the concept: Bug 18377553 : POOR CARDINALITY ESTIMATE WITH HISTOGRAMS AND VALUES > 32 BYTES

[Further reading on Histograms]

Comments (4)

4 Comments »

[…] the strategy could lead to a couple of anomalies if you were unlucky. I’ve already published a note about one such anomaly that can occur with fairly long character strings, this note describes another anomaly that could […]

Pingback by Frequency Histograms – 6 « Oracle Scratchpad — October 20, 2010 @ 8:12 am BST Oct 20,2010 | Reply
[…] Effects of long characters strings in frequency histograms is better: improved. […]

Pingback by 12 Histogram fixes | Oracle Scratchpad — August 4, 2013 @ 6:52 pm BST Aug 4,2013 | Reply
[…] Long strings problem – until 12c Oracle stored at most 32 bytes of a string in the endpoint_actual_value column. […]

Pingback by Faking Histograms | Oracle Scratchpad — October 15, 2018 @ 1:37 pm BST Oct 15,2018 | Reply
[…] the cardinality calculations can go wrong. If two character match over the first few characters the numeric representation of those strings that Oracle uses in a histogram is identical, and if they are long enough even the […]

Pingback by Misdirection | Oracle Scratchpad — December 7, 2018 @ 11:48 am GMT Dec 7,2018 | Reply

RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome. Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Oracle Scratchpad

October 13, 2010