Oracle Scratchpad

October 19, 2010

Frequency Histograms – 6

Filed under: Histograms,Statistics,Troubleshooting,Tuning — Jonathan Lewis @ 6:01 pm BST Oct 19,2010

In an earlier post on frequency histograms I described how Oracle creates an approximate histogram when dealing with character columns, and warned you that the strategy could lead to a couple of anomalies if you were unlucky. I’ve already published a note about one such anomaly that can occur with fairly long character strings, this note describes another anomaly that could appear in less extreme cases. Again, we start by constructing a data set.


create table t1  (v1 varchar2(42));

insert	into t1
select	'next day'
from	all_objects
where 	rownum <= 150
;

insert into t1
select	'same day'
from 	all_objects
where 	rownum <= 350
;

insert into t1
select 	'priority high'
from 	all_objects
where 	rownum <= 500;
 
begin
 	dbms_stats.gather_table_stats(
 		ownname		 => user,
		tabname		 =>'t1',
		estimate_percent => 100,
		method_opt 	 => 'for all columns size 254'
	);
end;
/

select
        endpoint_number,
        endpoint_number - nvl(prev_endpoint,0)  frequency,
        hex_val,
        chr(to_number(substr(hex_val, 2,2),'XX')) ||
        chr(to_number(substr(hex_val, 4,2),'XX')) ||
        chr(to_number(substr(hex_val, 6,2),'XX')) ||
        chr(to_number(substr(hex_val, 8,2),'XX')) ||
        chr(to_number(substr(hex_val,10,2),'XX')) ||
        chr(to_number(substr(hex_val,12,2),'XX')),
        endpoint_actual_value
from    (
        select
                endpoint_number,
                lag(endpoint_number,1) over(
                        order by endpoint_number
                )                                                       prev_endpoint,
                to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val,
                endpoint_actual_value
        from
                user_tab_histograms
        where
                table_name = 'T1'
        and     column_name = 'V1'
        )
order by
        endpoint_number
;

ENDPOINT_NUMBER  FREQUENCY HEX_VAL                         CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
            150        150  6E65787420644D6D50DD04A6400000 next d
            650        500  7072696F72698970D6651967200000 priori
           1000        350  73616D6520645C36B7AD479D600000 same d

3 rows selected.

This example captures a histogram at a point when my data set has only three popular values in it, and those values are sufficiently different that their “6-byte encoding” is enough for Oracle to tell the difference between them so the histogram doesn’t use the endpoint_actual_value column.

But I have a problem. During the course of the working day I introduce some data with different status codes, and all the queries I run are about these status codes, but by the end of the day (when, possibly, the stats will be collected again by the automatic stats collection job) all the “interesting” data has been updated to one of the three popular values.

What does Oracle do about queries for my “rare” values ? It depends what they look like. Here’s a couple of examples from 10.2.0.3:

select
	count(*)
from
	t1
where
	v1 = 'overnight'
;

select
	count(*)
from
	t1
where
	v1 = 'priority low'
;

set autotrace off

Neither of these values is in the histogram, so I would like Oracle to give a low cardinality estimate to them. But one of the values is very similar to a value that DOES exist in the histogram. This leads to an interesting variation in the numbers that appear in the execution plans:

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    11 |     2 |
|   1 |  SORT AGGREGATE    |      |     1 |    11 |       |
|*  2 |   TABLE ACCESS FULL| T1   |     1 |    11 |     2 |
-----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1"='overnight')

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    11 |     2 |
|   1 |  SORT AGGREGATE    |      |     1 |    11 |       |
|*  2 |   TABLE ACCESS FULL| T1   |   500 |  5500 |     2 |
-----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("V1"='priority low')

In the first case (‘overnight’) the optimizer has decided that the value doesn’t exist in the histogram. Since this database is running 10.2.0.3 it has set the resulting cardinality to 1 (this would be 75 in 10.2.0.4 or 11g – half the lowest known value in the histogram). But in the second case (‘priority low’) the optimizer has done its 6-byte manipulation and decided that ‘priority low’ matches ‘priority high’, and given it a cardinality of 500.

If Oracle had seen ‘priority low’ when creating the histogram it would have used the endpoint_actual_value column and generated two separate rows for ‘priority low’ and ‘priority high’. But it is perfectly feasible that because of timing (the interesting data exists only briefly during the day) or sampling (the interesting data is tiny compared to the popular values) you could see Oracle taking a “large data” execution path when you know that you’re only expecting to handle a small data set.

The solution, of course, is to write code to create a histogram that represents the situation as you want Oracle to see it.

Update for 12c – Aug 2013

Although 12c extends the stored string in endpoint_actual_value to 64 characters (in a column defined as varchar2(4000), this does not help with this example because that column is still only used if the endpoint_value shows duplicates. The problem above will still appear in 12c.

Bear in mind that this isn’t an Oracle error – how is the stats gathering code supposed to generate a histogram that is correct for data that really isn’t there ? This is one of those cases where, as a developer or DBA, you have to understand how your application works and either write code to create the histogram or pick the right time to gather statistics rather than leaving everything to the default collection strategy.

[Further reading on Histograms]

5 Comments »

  1. […] When stats gathering misses the values that would cause endpoint_actual_values to be collected: no change. […]

    Pingback by 12 Histogram fixes | Oracle Scratchpad — August 4, 2013 @ 6:52 pm BST Aug 4,2013 | Reply

  2. “…how is the stats gathering code supposed to generate a histogram that is correct for data that really isn’ there ?” True. A team I’ve worked with understands this well and adds gather_table_stats code where necessary during the course of the day. This is particularly important for “batch jobs” which delete / truncate and reload new data — the date obviously changes but other data may also change day-to-day.

    Comment by Hemant K Chitale — August 6, 2013 @ 7:48 am BST Aug 6,2013 | Reply

    • Absolutely the best thing to do – and more people should do it. Developers should know the data, and should know the special moments when critical stats need to be refreshed.

      Comment by Jonathan Lewis — August 6, 2013 @ 9:04 am BST Aug 6,2013 | Reply

  3. Hello Jonathan,
    My question has more to do with Hemant’s comment than with this post so please bear with me ;)
    In a dynamic environment, where truncate and reload new data is the “normal” way of working, also at a high rate, what would you recommend : using gather_table_stats after every truncate + data reload (eventually with a smaller estimate_percent so that it doesn’t take too long) or maybe drop all the stats for those tables and use dynamic sampling (of course with the appropriate level) ?

    Gabriel

    Comment by Gabriel — November 26, 2013 @ 11:18 am BST Nov 26,2013 | Reply

    • Gabriel,

      There’s no generic answer to that question – you have to know your data. There may be a handful of cases to cover an individual site, though, so each of the following could apply to some of your tables:

      a) Create a fixed set of stats for the table (but any sequence or time based column would be a threat)
      b) Gather stats on every reload – for 11g use auto_sample_size to get the fast “approximate NDV” – don’t gather histograms.
      c) Include code in the loading code that counts as it goes and creates statistics as part of the load process

      And in all cases, look for the few columns where a histogram is needed, and see if you can write code to create it rather than gather it if you think that gathering would be too slow, or unstable.

      Comment by Jonathan Lewis — November 29, 2013 @ 10:17 am BST Nov 29,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,989 other followers