While reading a blog post by Maria Colgan a couple of weeks ago I came across an observation about histograms that I’d not noticed before; worse still, it was a feature that seemed to make some “damage-limitation” advice I’d been giving for years a really bad idea! The threat appeared in these paragraphs:
Setting SIZE REPEAT ensures a histogram will only be created for any column that already has one. If the table is a partitioned table, repeat ensures a histogram will be created for a column that already has one on the global level.
What’s the down side to doing this?
The current number of buckets used in each histogram becomes the limit on the maximum number of buckets used for any histogram created in the future.
Unfortunately I’ve been saying for a very long time that you have to be very careful with histograms and should probably create them through PL/SQL code but if you have some frequency histograms that you’re sure are going to be well-behaved then using “for all columns size repeat” to gather the histogram is probably okay. But, after making the claim above, Maria’s blog posting demonstrated the truth of the claim in a demonstration that showed the highly undesirable consequences.
So imagine this: you create a frequency histogram which happens to produce 26 buckets on a particular column; from then on every time you run the gather with size repeat Oracle tries to generate 26 buckets. One day the data looks a little different, temporarily there are only 25 distinct values so on the next gather you get just 25 buckets – which means that when the “missing” value re-appears 12c will give you a Top-N histogram or even a hybrid histogram (11g would have to give you a height-balanced histogram if it noticed all 26 values when 25 buckets had been requested). It is not safe to use size repeat if the number of distinct values that actually exist can vary from day to day.
I have to say that I was fairly shocked that I’d not come across this threat before – so obviously I created a simple model to check how nasty things could get. I had a copy of 11.2.0.4 handy and created a couple of tables cloning the data from all_objects because that’s got a couple of columns that are good for producing frequency histograms.
rem Script: histogram_repeat.sql rem Author: Jonathan Lewis rem Dated: June 2017 drop table t2; drop table t1; create table t1 as select * from all_objects; create table t2 as select * from t1; delete from t1 where object_type = 'EDITION'; delete from t1 where object_type = 'EVALUATION CONTEXT'; commit; pause ================ Baseline ======================================= select count(distinct object_type), count(distinct owner) from t1; execute dbms_stats.gather_table_stats(user,'t1',method_opt =>'for columns object_type owner') select column_name, count(*) from user_tab_histograms where table_name = 'T1' and column_name in ( 'OBJECT_TYPE','OWNER') group by column_name order by column_name ; select column_name, num_buckets, histogram from user_tab_columns where table_name = 'T1' and column_name in ( 'OBJECT_TYPE','OWNER') order by column_name ; insert into t1 select * from t2 where object_type = 'EDITION'; insert into t1 select * from t2 where object_type = 'EVALUATION CONTEXT'; commit;
After creating the data I’ve deleted a few rows from t1, reported the number of distinct values in t1 for owner and object_type, then gathered stats on just those two columns using the default size. I’ve then reported the number of histogram buckets in two ways, by counting them in user_tab_histograms and by reporting them directly (with histogram type) from user_tab_columns. Then I’ve finished off by re-inserting (copying from t2) the rows I previously deleted, giving me a couple more object_type values in the table. Here are the results of the queries:
================ Baseline ======================================= COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOWNER) -------------------------- -------------------- 23 11 COLUMN_NAME COUNT(*) -------------------- ---------- OBJECT_TYPE 17 OWNER 7 COLUMN_NAME Buckets HISTOGRAM -------------------- ---------- --------------- OBJECT_TYPE 17 FREQUENCY OWNER 7 FREQUENCY
I’m running on 11.2.0.4 – and I have two frequency histograms that have missed a few of the distinct values. But that’s because on the default settings 11g uses sampling (typically about 5,500 rows for smaller data sets) when creating histograms. So re-running the gather with size repeat shouldn’t allow the number of buckets to grow. Here’s what I got when I re-ran the gather (with size repeat) and two queries a further three times
method_opt =>'for columns object_type size repeat owner size repeat' ================ Repeat 1 ======================================= COLUMN_NAME COUNT(*) -------------------- ---------- OBJECT_TYPE 16 OWNER 9 COLUMN_NAME Buckets HISTOGRAM -------------------- ---------- --------------- OBJECT_TYPE 16 FREQUENCY OWNER 9 FREQUENCY ================ Repeat 2 ======================================= COLUMN_NAME COUNT(*) -------------------- ---------- OBJECT_TYPE 18 OWNER 8 COLUMN_NAME Buckets HISTOGRAM -------------------- ---------- --------------- OBJECT_TYPE 18 FREQUENCY OWNER 8 FREQUENCY ================ Repeat 3 ======================================= COLUMN_NAME COUNT(*) -------------------- ---------- OBJECT_TYPE 13 OWNER 9 COLUMN_NAME Buckets HISTOGRAM -------------------- ---------- --------------- OBJECT_TYPE 13 FREQUENCY OWNER 9 FREQUENCY
On the first repeat I got even fewer buckets; but on the second repeat the number of buckets bounced back up and even exceeded the original count; then on the third repeat the number of buckets dropped significantly. If you run the test your results will probably vary, but that’s the effect of the random selection of rows used to generate the histogram. Key point, though, is that in 11g the number of buckets generated by the gather is not limited by the current number of buckets.
But…
What happens with 12.1.0.2 – here are the results. Remember I deleted two sets of object_type before I gathered the first set of stats, then put them back in before doing the repeat gathers. (The number of distinct object_types in 12c is more than I had in 11g).
================ Baseline ======================================= COUNT(DISTINCTOBJECT_TYPE) COUNT(DISTINCTOWNER) -------------------------- -------------------- 27 25 COLUMN_NAME COUNT(*) -------------------- ---------- OBJECT_TYPE 27 OWNER 25 COLUMN_NAME NUM_BUCKETS HISTOGRAM -------------------- ----------- --------------- OBJECT_TYPE 27 FREQUENCY OWNER 25 FREQUENCY ================ Repeat 1 ======================================= COLUMN_NAME COUNT(*) -------------------- ---------- OBJECT_TYPE 27 OWNER 25 COLUMN_NAME NUM_BUCKETS HISTOGRAM -------------------- ----------- --------------- OBJECT_TYPE 27 TOP-FREQUENCY OWNER 25 FREQUENCY ================ Repeat 2 ======================================= COLUMN_NAME COUNT(*) -------------------- ---------- OBJECT_TYPE 27 OWNER 25 COLUMN_NAME NUM_BUCKETS HISTOGRAM -------------------- ----------- --------------- OBJECT_TYPE 27 TOP-FREQUENCY OWNER 25 FREQUENCY ================ Repeat 3 ======================================= COLUMN_NAME COUNT(*) -------------------- ---------- OBJECT_TYPE 27 OWNER 25 COLUMN_NAME NUM_BUCKETS HISTOGRAM -------------------- ----------- --------------- OBJECT_TYPE 27 TOP-FREQUENCY OWNER 25 FREQUENCY
The number of distinct values for object_type is initially 27, but after gathering stats the first time I added back two more object_type values; but the subsequent gathers stuck with 27 buckets rather than extending to 29 buckets – so the histogram changed from frequency to Top-N. If you check Maria’s blog again you’ll see that this can make a big difference, particularly if the two new values happen to be the lowest and highest values for the column.
The number of buckets on a REPEAT is limited to the number of existing buckets in 12c. That, to my mind, is a major change in behaviour and one you’ll have to watch out for on the upgrade. In 11g if the number of actual values stored dropped briefly the situation was self-correcting; if some new values were introduced the situation was self-correcting – although in both cases the histogram isn’t necessarily telling the truth the way you’d like it because of the small sample sizes used. In 12c the situation doesn’t self-correct. and may introduce a massive change in the arithmetic (as shown in Maria’s example).
The big difference, of course, is that 12c is gathering on a 100% sample using the variation of the approximate_ndv mechanism – so it will always find the right number of values if a frequency histogram is appropriate: presumably this is what was supposed to make it okay to reproduce the number of buckets previously used. In 11g with its small sample size the number of buckets created couldn’t be guaranteed to match the number of distinct values, so I guess the code in 11g wasn’t written to be so rigorous in its assumption about the number of buckets to use next time.
tl;dr
When you upgrade from 11g to 12c think very carefully about whether or not you can still use a “table-level” size repeat to gather histograms – the upgrade may force you to identify specifically the columns that need histograms so that you can name them with an explicit (large enough) size in a gather command.
Footnote:
Don’t forget you can set a table preference for each table specifying a method_opt (though I found this could break on “complex” settings of method_opt in earlier versions); so for columns that need a frequency histogram you could fix a sufficiently large number of buckets by specifying it in the method_opt with a call to dbms_stats.set_table_prefs().
Thanks,
great post.
Idea:
Oracle have to make “better” relation between histogram and indexes, actually need some kind of new index, index that not include rows from values with bad selectivity, just ignore that values like what doing with NULL value. Index will be “smaller and faster”.
We can do it in 12c by partial index, but I want “real time” relation/impact between histogram and index.
thanks,
Andjelko Miovcic
Comment by andjelko miovcic — June 6, 2017 @ 3:12 pm BST Jun 6,2017 |
[…] repeat” works differently in 12c from the way it worked in 11g – and there’s a separate blog note about […]
Pingback by Upgrades | Oracle Scratchpad — November 29, 2017 @ 11:15 am GMT Nov 29,2017 |
[…] Footnote 3: Since writing this article I’ve discovered that “size repeat” has what Oracle should have published as a “notable change in behaviour” for 12c and it is no longer a suitable method_opt for histogram collection. See this note. […]
Pingback by Delete Histogram | Oracle Scratchpad — November 29, 2017 @ 11:18 am GMT Nov 29,2017 |