Histogram Upgrade – 2 | Oracle Scratchpad

June 1, 2017

Histogram Upgrade – 2

Filed under: 12c,Histograms,Oracle,Statistics — Jonathan Lewis @ 6:00 pm BST Jun 1,2017

While reading a blog post by Maria Colgan a couple of weeks ago I came across an observation about histograms that I’d not noticed before; worse still, it was a feature that seemed to make some “damage-limitation” advice I’d been giving for years a really bad idea! The threat appeared in these paragraphs:

Setting SIZE REPEAT ensures a histogram will only be created for any column that already has one. If the table is a partitioned table, repeat ensures a histogram will be created for a column that already has one on the global level.

What’s the down side to doing this?

The current number of buckets used in each histogram becomes the limit on the maximum number of buckets used for any histogram created in the future.

Unfortunately I’ve been saying for a very long time that you have to be very careful with histograms and should probably create them through PL/SQL code but if you have some frequency histograms that you’re sure are going to be well-behaved then using “for all columns size repeat” to gather the histogram is probably okay. But, after making the claim above, Maria’s blog posting demonstrated the truth of the claim in a demonstration that showed the highly undesirable consequences.

So imagine this: you create a frequency histogram which happens to produce 26 buckets on a particular column; from then on every time you run the gather with size repeat Oracle tries to generate 26 buckets. One day the data looks a little different, temporarily there are only 25 distinct values so on the next gather you get just 25 buckets – which means that when the “missing” value re-appears 12c will give you a Top-N histogram or even a hybrid histogram (11g would have to give you a height-balanced histogram if it noticed all 26 values when 25 buckets had been requested). It is not safe to use size repeat if the number of distinct values that actually exist can vary from day to day.

I have to say that I was fairly shocked that I’d not come across this threat before – so obviously I created a simple model to check how nasty things could get. I had a copy of 11.2.0.4 handy and created a couple of tables cloning the data from all_objects because that’s got a couple of columns that are good for producing frequency histograms.


rem     Script:         histogram_repeat.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2017

drop table t2;
drop table t1;

create table t1 as select * from all_objects;
create table t2 as select * from t1;

delete from t1 where object_type = 'EDITION';
delete from t1 where object_type = 'EVALUATION CONTEXT';
commit;

pause ================  Baseline =======================================

select  count(distinct object_type), count(distinct owner) from t1;

execute dbms_stats.gather_table_stats(user,'t1',method_opt =>'for columns object_type owner')

select  column_name, count(*)
from    user_tab_histograms
where   table_name = 'T1'
and     column_name in ( 'OBJECT_TYPE','OWNER')
group by column_name
order by column_name
;

select  column_name, num_buckets, histogram
from    user_tab_columns
where   table_name = 'T1'
and     column_name in ( 'OBJECT_TYPE','OWNER')
order by column_name
;

insert into t1 select * from t2 where object_type = 'EDITION';
insert into t1 select * from t2 where object_type = 'EVALUATION CONTEXT';
commit;

After creating the data I’ve deleted a few rows from t1, reported the number of distinct values in t1 for owner and object_type, then gathered stats on just those two columns using the default size. I’ve then reported the number of histogram buckets in two ways, by counting them in user_tab_histograms and by reporting them directly (with histogram type) from user_tab_columns. Then I’ve finished off by re-inserting (copying from t2) the rows I previously deleted, giving me a couple more object_type values in the table. Here are the results of the queries:


================  Baseline =======================================

COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOWNER)
-------------------------- --------------------
                        23                   11

COLUMN_NAME            COUNT(*)
-------------------- ----------
OBJECT_TYPE                  17
OWNER                         7

COLUMN_NAME             Buckets HISTOGRAM
-------------------- ---------- ---------------
OBJECT_TYPE                  17 FREQUENCY
OWNER                         7 FREQUENCY

I’m running on 11.2.0.4 – and I have two frequency histograms that have missed a few of the distinct values. But that’s because on the default settings 11g uses sampling (typically about 5,500 rows for smaller data sets) when creating histograms. So re-running the gather with size repeat shouldn’t allow the number of buckets to grow. Here’s what I got when I re-ran the gather (with size repeat) and two queries a further three times


method_opt =>'for columns object_type size repeat owner size repeat'

================  Repeat 1 =======================================

COLUMN_NAME            COUNT(*)
-------------------- ----------
OBJECT_TYPE                  16
OWNER                         9

COLUMN_NAME             Buckets HISTOGRAM
-------------------- ---------- ---------------
OBJECT_TYPE                  16 FREQUENCY
OWNER                         9 FREQUENCY
================  Repeat 2 =======================================

COLUMN_NAME            COUNT(*)
-------------------- ----------
OBJECT_TYPE                  18
OWNER                         8

COLUMN_NAME             Buckets HISTOGRAM
-------------------- ---------- ---------------
OBJECT_TYPE                  18 FREQUENCY
OWNER                         8 FREQUENCY
================  Repeat 3 =======================================

COLUMN_NAME            COUNT(*)
-------------------- ----------
OBJECT_TYPE                  13
OWNER                         9

COLUMN_NAME             Buckets HISTOGRAM
-------------------- ---------- ---------------
OBJECT_TYPE                  13 FREQUENCY
OWNER                         9 FREQUENCY

On the first repeat I got even fewer buckets; but on the second repeat the number of buckets bounced back up and even exceeded the original count; then on the third repeat the number of buckets dropped significantly. If you run the test your results will probably vary, but that’s the effect of the random selection of rows used to generate the histogram. Key point, though, is that in 11g the number of buckets generated by the gather is not limited by the current number of buckets.

But…

What happens with 12.1.0.2 – here are the results. Remember I deleted two sets of object_type before I gathered the first set of stats, then put them back in before doing the repeat gathers. (The number of distinct object_types in 12c is more than I had in 11g).


================  Baseline =======================================

COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOWNER)
-------------------------- --------------------
                        27                   25

COLUMN_NAME            COUNT(*)
-------------------- ----------
OBJECT_TYPE                  27
OWNER                        25

COLUMN_NAME          NUM_BUCKETS HISTOGRAM
-------------------- ----------- ---------------
OBJECT_TYPE                   27 FREQUENCY
OWNER                         25 FREQUENCY

================  Repeat 1 =======================================

COLUMN_NAME            COUNT(*)
-------------------- ----------
OBJECT_TYPE                  27
OWNER                        25

COLUMN_NAME          NUM_BUCKETS HISTOGRAM
-------------------- ----------- ---------------
OBJECT_TYPE                   27 TOP-FREQUENCY
OWNER                         25 FREQUENCY

================  Repeat 2 =======================================

COLUMN_NAME            COUNT(*)
-------------------- ----------
OBJECT_TYPE                  27
OWNER                        25

COLUMN_NAME          NUM_BUCKETS HISTOGRAM
-------------------- ----------- ---------------
OBJECT_TYPE                   27 TOP-FREQUENCY
OWNER                         25 FREQUENCY

================  Repeat 3 =======================================

COLUMN_NAME            COUNT(*)
-------------------- ----------
OBJECT_TYPE                  27
OWNER                        25

COLUMN_NAME          NUM_BUCKETS HISTOGRAM
-------------------- ----------- ---------------
OBJECT_TYPE                   27 TOP-FREQUENCY
OWNER                         25 FREQUENCY

The number of distinct values for object_type is initially 27, but after gathering stats the first time I added back two more object_type values; but the subsequent gathers stuck with 27 buckets rather than extending to 29 buckets – so the histogram changed from frequency to Top-N. If you check Maria’s blog again you’ll see that this can make a big difference, particularly if the two new values happen to be the lowest and highest values for the column.

The number of buckets on a REPEAT is limited to the number of existing buckets in 12c. That, to my mind, is a major change in behaviour and one you’ll have to watch out for on the upgrade. In 11g if the number of actual values stored dropped briefly the situation was self-correcting; if some new values were introduced the situation was self-correcting – although in both cases the histogram isn’t necessarily telling the truth the way you’d like it because of the small sample sizes used. In 12c the situation doesn’t self-correct. and may introduce a massive change in the arithmetic (as shown in Maria’s example).

The big difference, of course, is that 12c is gathering on a 100% sample using the variation of the approximate_ndv mechanism – so it will always find the right number of values if a frequency histogram is appropriate: presumably this is what was supposed to make it okay to reproduce the number of buckets previously used. In 11g with its small sample size the number of buckets created couldn’t be guaranteed to match the number of distinct values, so I guess the code in 11g wasn’t written to be so rigorous in its assumption about the number of buckets to use next time.

tl;dr

When you upgrade from 11g to 12c think very carefully about whether or not you can still use a “table-level” size repeat to gather histograms – the upgrade may force you to identify specifically the columns that need histograms so that you can name them with an explicit (large enough) size in a gather command.

Footnote:

Don’t forget you can set a table preference for each table specifying a method_opt (though I found this could break on “complex” settings of method_opt in earlier versions); so for columns that need a frequency histogram you could fix a sufficiently large number of buckets by specifying it in the method_opt with a call to dbms_stats.set_table_prefs().

Comments (3)

3 Comments »

Thanks,
great post.
Idea:
Oracle have to make “better” relation between histogram and indexes, actually need some kind of new index, index that not include rows from values with bad selectivity, just ignore that values like what doing with NULL value. Index will be “smaller and faster”.
We can do it in 12c by partial index, but I want “real time” relation/impact between histogram and index.

thanks,
Andjelko Miovcic

Comment by andjelko miovcic — June 6, 2017 @ 3:12 pm BST Jun 6,2017 | Reply
[…] repeat” works differently in 12c from the way it worked in 11g – and there’s a separate blog note about […]

Pingback by Upgrades | Oracle Scratchpad — November 29, 2017 @ 11:15 am GMT Nov 29,2017 | Reply
[…] Footnote 3: Since writing this article I’ve discovered that “size repeat” has what Oracle should have published as a “notable change in behaviour” for 12c and it is no longer a suitable method_opt for histogram collection. See this note. […]

Pingback by Delete Histogram | Oracle Scratchpad — November 29, 2017 @ 11:18 am GMT Nov 29,2017 | Reply

RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome. Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Oracle Scratchpad

June 1, 2017