Extended Stats

April 11, 2012

Extended Stats

Filed under: CBO,extended stats,Oracle,Statistics — Jonathan Lewis @ 11:37 pm BST Apr 11,2012

I’m very keen on the 11g extended stats feature, but I’ve just discovered the first of a list of critical weaknesses in the implementation details that could lead to some surprising instability in execution plans [See end of article for links to further problems]. It’s a combination of “column group” statistics and “out of range” predicates. Let’s start with some sample data. (Note: Initially tested on 11.2.0.4 and 12.1.0.1 – still accurate on 12.2.0.1 and 19.3.0.0):

rem
rem     Script:         extended_stats_04.sql
rem     Author:         Jonathan Lewis
rem     Dated:          April 2012
rem
rem     Last tested 
rem             12.1.0.1
rem             11.2.0.4
rem

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 10000 -- > comment to avoid wordpress format issue
)
select
        mod(rownum,100)         col1,
        mod(rownum,10)          col2
from
        generator       v1,
        generator       v2
where
        rownum <= 50000 -- > comment to avoid wordpress format issue
;

begin 
        dbms_stats.gather_table_stats(
                ownname         => user,
                tabname         =>'T1',
                method_opt      => 'for all columns size 1'
        );
end;
/

I’ve got 100 distinct values for col1 and 10 distinct values for col2 – so the optimizer will assume that I’ve got 1,000 distinct values for the combination. Since there are 50,000 rows this means 50 rows per combination. So let’s see this demonstrated through a simple query and plan:

set autotrace traceonly explain

select  *
from    t1
where
        col1 = 50
and     col2 = 5
;

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    50 |   300 |    13 |
|*  1 |  TABLE ACCESS FULL| T1   |    50 |   300 |    13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("COL1"=50 AND "COL2"=5)

The optimizer has, as predicted, estimated the cardinality (Rows) for this query as 50. If you examine my data definition, though, you will see that the col1 and col2 values align in a way that means there are really only 100 distinct combination, with 500 rows per combination. This looks like one of those cases where extended statistics might be useful. So let’s create a suitable set:

begin
        dbms_output.put_line(
                dbms_stats.create_extended_stats(
                        ownname         => user,
                        tabname         => 'T1',
                        extension       => '(col1, col2)'
                )
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                block_sample     => true,
                method_opt       => 'for all hidden columns size 1'
        );
end;
/

With these stats in place, I repeat my query with the following effect:


Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   500 |  3000 |    13 |
|*  1 |  TABLE ACCESS FULL| T1   |   500 |  3000 |    13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("COL1"=50 AND "COL2"=5)

Oracle has used the “column group” statistics to derive a good estimate of the combined selectivity and produced an estimate of 500 rows. But what happens as new data (following the same pattern) is entered, and the queries change to keep up with the data; consider, for example, the slightly artificial query:

select  *
from    t1
where
        col1 = 50
and     col2 = 9.001
;

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    50 |   300 |    13 |
|*  1 |  TABLE ACCESS FULL| T1   |    50 |   300 |    13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("COL1"=50 AND "COL2"=9.001)

I’ve used the value 9.001 because it is only just outside the known value range for col2 (which ranges from 0 to 9) – and the optimizer has immediately switched from the extended stats back to the single column stats and the “product of independent predicates” model to estimate the cardinality, leading to a dramatic change in the estimate.

This strategy may be a deliberate choice on the part of the developers, but personally I think that it’s not appropriate. If you’ve created column group stats like this it seems reasonably likely that you’re trying to show Oracle a consistent pattern in the data; however it’s quite possible that the data you’re using shows some sort of time-dependency such that the pattern doesn’t change, but the actual values in (at least one) of the columns keeps increasing.

In this case, applying the normal “linear decay” strategy to the extended stats seems much more sensible – if one (or both) of the predicates is slightly out of range then use the extended stats to derive an estimate then apply the decay factor(s) to that value, rather than falling back on multiplying the individual selectivities.

Side note: The case where I saw this problem was one where rows were constantly being inserted and deleted. One column was a time-based audit id, the other was an action id, and every audit id produced a few dozen action ids, with some overlap of actions across audit ids. New, higher valued, audit rows were constantly being added to this table, while old, lower ones, were being deleted. The effect was that the extended stats because useless within seconds of being generated.

Footnote:

Coincidentally, Randolf Geist highlighted another edge case a couple of days ago – but it hasn’t stopped me thinking that extended stats, and “column group” stats especially are a terrific feature.

Update (Feb 2014)

Christian Antognini has come up with another example of how extended stats might not behave as you expect; in this case relating to side effects of histograms. (With a reference to Bug 6972291 and the associated fix_control to change the behaviour to something that may be more desirable: set “_fix_control”=”6972291:ON”

Update (Nov 2015)

An item on OTN has just highlighted another anomaly – if your column group includes a nullable column and your chosen multi-column where clause includes an “is null” predicate then the column group stats aren’t applied. For more details here’s a blog note I wrote as a follow-up.

Update (Jan 2016)

Another anomaly that can appear with extended stats got a mention on Twitter yesterday, with a link to an article by Nenad Noveljic that demonistrates a parsing problem and offers a workaround to the problem.

Update (Dec 2019)

Bobby Durrett reports a problem when upgrading from 11.2 to 19c using expdp/impdp. The import failed on a table that had extended stats defined because of a problem handling the virtual column.

Comments (30)

30 Comments »

Hi Jonathan,

This behavior is expected and it was a deliberate design decision to prevent over-estimations when one of the values supplied is ‘out of range’. We can’t ignore the ‘out of range’ scenario just because we have a column group. Extended statistics do not contain the min, max values for the column group so we rely on the individual column statistics to check for ‘out of range’ scenarios like yours. When one of the columns is ‘out of range’, we revert back to the column statistics because we know it is going to generate a lower selectivity range and if one of the columns is ‘out of range’ then the number of rows returned will be lower or none at all, as in your example.

Thanks,
Maria

Comment by Maria Colgan — April 12, 2012 @ 12:47 am BST Apr 12,2012 | Reply
- Maria,
  
  Thanks for the feedback. I’d like to present an argument for a different strategy.
  
  Since I’m only likely to create a column group because I know that the product of the two selectivities is going to be wrong, I would expect that reverting back to the product of selectivities will still be wrong if the data moves out of range.
  
  It seems more appropriate (or more consistent with the linear decay strategy) to stick with the column group selectivity, but work out the separate linear decay factors of the individual columns and apply their product to the column group selectivity.
  
  It crosses my mind, also, that anyone who is NOT using the “approximate NDV” approach (i.e auto_sample_size) of 11g may find that they don’t always capture the min and max values when they gather statistics – which means they could run queries that ARE within the limits, but don’t appear to be, protentially resulting in extreme changes in plans for “no apparent reason” after they’ve collected an unlucky set of statistics. (And in this case I’m not thinking about data sets where one or more of the columns suffers from the problems of continually increasing values.)
  
  Comment by Jonathan Lewis — April 12, 2012 @ 8:58 am BST Apr 12,2012 | Reply
Thanks for pointing this out, Jonathan. Yet another case to use manual adjustments to the high_value of columns – as all should do manually, not automatic ;-)

Maria,
a question/wish to CBO dev team :) It would be a nice feature to add a data dictionary flag for column that have always increasing values, so that CBO will not do “linear decay” for them. It’s quite easy to auto-detect such columns automatically in the automatic job; and it will save many routine actions in the stats management area.

Comment by Timur Akhmadeev — April 12, 2012 @ 7:54 am BST Apr 12,2012 | Reply
- Hi,
  
  +1 for Timur. It was my first idea few moments after linear decay has been introduced (I think it was introduced to 10gR1 in PATCHSET 2 or 3). linear decay is nice feature, unless pk of table is based on sysdate/seuqence etc.
  
  Comment by Pavol Babel — April 12, 2012 @ 8:09 pm BST Apr 12,2012 | Reply
It may in fact be useful to have a column attribute, whether or not associated with column groups, that indicates the actual high and/or low values should be tracked in the dictionary. While recognizing the danger of indiscriminately adding a dictionary update to routine inserts and deletes, it is the sort of feature where it is difficult to do an unbiased gedanken experiment because real world data varies so much. Presumably one update per commit would be enough, since you’re making a promise to the optimizer, not the user, about the exact high and/or low value. In JL’s “somewhat artificial” query the cardinality estimate would then be zero if the high value of col2 were being tracked, correct?

mwf

Comment by Mark Farnham — April 12, 2012 @ 1:27 pm BST Apr 12,2012 | Reply
should have written at most one update per commit, since you would only update if the high value being tracked was exceeded or the lowest value and all ties was removed. Yet another time delete would be more expensive than insert if you had to check for ties.

Comment by Mark Farnham — April 12, 2012 @ 1:30 pm BST Apr 12,2012 | Reply
“If you examine my data definition, though, you will see that the col1 and col2 values align in a way that means there are really only 100 distinct combination, with 500 rows per combination.”

Where there is correlation between columns, probably indicates some inadequately modelled functional dependency that possibly should have been normalised. I have a general concern with introducing optimiser features that attempt to smooth out design issues. Because as far as I can observe historically this encourages more bad designs.

Comment by Adrian — April 13, 2012 @ 12:58 pm BST Apr 13,2012 | Reply
- Adrian,
  
  The code I used to generate the data was simply a mechanism to ensure that I had two columns where the product of selectivities was not the selectivity of the combination, so I wouldn’t worry too much about arguing whether that specific model was right or wrong.
  
  In general, though, I think you are mixing correlation and causation a little in your assessment: two columns can be correlated because they are dependent on the same key; consider, for example, a delivery service that records a pick-up date and a delivery date against each parcel.
  
  Comment by Jonathan Lewis — April 15, 2012 @ 6:21 pm BST Apr 15,2012 | Reply
Gee whiz, I think it is pretty common to have some correlations that are NOT appropriate to model. Any values that tend, for example, to always go up are correlated in a way that the optimizer might put to good use. But they might go up at different rates and be nothing you could constrain or describe in a model rather than in the relationship of the actual values, simply because both going up all the time is a coincidence, not a business rule.

Comment by Mark Farnham — April 13, 2012 @ 2:34 pm BST Apr 13,2012 | Reply
Possibly some truth in what you say there. But often things are unecessarily forced into this category by the deliberate and sometimes inappropriate use of things that we know will fit this category and that could be easily avoided at design. “Things that go up all the time” in my experience are often not coincidental but rather a part of some physical design/imeplementation choices that could easily have been avoided. Often as a function of applying arbitrary surrogate identifiers. But I was trying to avoid straying into that debate by being vague.

Comment by Adrian — April 13, 2012 @ 2:41 pm BST Apr 13,2012 | Reply
The logic of my point is that you’d have to win the debate that there are NOT correlations in real world data that are inappropriate to model or constrain to be on firm ground in contending this is not a useful optimizer feature. That is very different from often observing correlations that are an artifact of bad design. I have no doubt you see them a lot. I vote for good models, Mom, and apple pie. I think the potential affect of extended statistics and column groups as they are perfected is likely to be immense for models good and bad.

Comment by Mark Farnham — April 13, 2012 @ 3:16 pm BST Apr 13,2012 | Reply
Maybe. Perhaps I’d only have to win the debate that there are not sufficient cases in a sensibly modelled situation to warrant development effort having been spent “solving” the “problem” rather than, say fixing some more of the bugs in existing functionality ;). Especially if the feature was going to deliberately not work (as indicated) in what may turn out to be the most common realisation of this scenario.
Possibly I shouldn’t have posted this observation here because it’s tangential to the point being made anyway.
I am intrigued at the idea of correlations that shouldn’t be modelled though, surely all non key attributes should be dependent only on the key rather than each other, so aside from components in a composite identifier or the rareish cases where there are multiple candidates for a key, that just leaves design errors.

Comment by Adrian — April 13, 2012 @ 3:44 pm BST Apr 13,2012 | Reply
[…] Jonathan Lewis just discovered a critical weakness in one of the implementation details of 11g extended stats that could lead to some surprising instability in execution plans. […]

Pingback by Log Buffer #267, A Carnival of the Vanities for DBAs | The Pythian Blog — April 14, 2012 @ 12:19 am BST Apr 14,2012 | Reply
“In general, though, I think you are mixing correlation and causation a little in your assessment: two columns can be correlated because they are dependent on the same key; consider, for example, a delivery service that records a pick-up date and a delivery date against each parcel.”

Sure. But surely relying on a correlation rather than causation is problematic? Isn’t creating stats to represent a non causative relationship spectacularly falling prey to a logical fallacy? What if some absent, countervailing variable changes? such as “time” no longer necessarily advancing because the company starts delivering across time zones but continues to use the system that records only local delivery times without a timezone? or starts storing goods for long dated future deliveries? or hands off parts of the logistics to a third party and have large number of completed deliveries with no recorded delivery time? or an artificial relationship caused by ascending keys is suddenly invalidated by a sequence wrap?

Would it not be better to enable query efficiency by modelling and constraining known relationships rigorously and encouraging the optimiser to take advantage of defined constraints? For example: a priority parcel delivery has an SLA of two working days. If we’re querying for unsuccessful priority deliveries we instantly know something about the relationship of the dates that would be more expensive and complicated to capture automatically by correlation, and when the SLA changes I know which set of assumptions is quicker to recalculate. If we know enough about a data relationship to create a column group over it, perhaps it’s better to let us apply the knowledge we have rather than have the optimiser need to recalculate it for us (via expensive disk reads).

Comment by Adrian — April 23, 2012 @ 10:12 am BST Apr 23,2012 | Reply
- Adrian,
  Your earlier objection was that:
  
  “Where there is correlation between columns, probably indicates some inadequately modelled functional dependency that possibly should have been normalised.”
  
  Now that I’ve pointed out that it is perfectly feasible to find correlation between measurements that have a common cause (which isn’t a logical fallacy, by the way), your argument is that the business definition may change. But if you change the business definition then you should go back and review the data model – and that’s where any information about the impact on the previously recognised correlation would be re-considered.
  
  Comment by Jonathan Lewis — April 23, 2012 @ 1:47 pm BST Apr 23,2012 | Reply
- And if correlations change, not because of a fundamental change in the business that should be remodeled, but simply because of a change in actual operations, that is exactly why capturing the information about the correlations as a fact of statistics is a good thing. As metadata, if it becomes no longer useful it can simply go away, and as with all statistics if they become stale or misleading they can be fixed.
  
  Comment by Mark Farnham — April 23, 2012 @ 1:55 pm BST Apr 23,2012 | Reply
The logical fallacy I was pointing to was the inference that an observed correlation has an immutable cause which is reliable enough for us to base our performance strategy on (or anything on for that matter). For example the correlation between house prices and time caused a certain amount of consternation recently for folks who should’ve known better. On the other hand if we know for certain there is an actual correlation then my argument is that there are more efficient ways of helping out an optimiser than by having it collect stats, which will all know to be an expensive operation in a system of any size, simply to learn about exactly the relationship of which we are already aware.Anyway, thanks both for indulging me with replies. I get the feeling we’re not going to agree.

Comment by Adrian — April 24, 2012 @ 3:04 pm BST Apr 24,2012 | Reply
Hi Jonathan,
Referring to your book CBO on page 269, “Unfortunately there are problems that still need addressing—lots of them. Let’s try to find
a few questions about the limitations of the formulae.
• What are you supposed to do if you have two or more join columns?”

I would like to know whether the standard join cardinality formula is applicable if I have more than one join columns?Could you please explain join cardinality for 3 or more join columns.

When i tried with the following SQL,I couldn’t do it.

SQL> @xplan
Plan hash value: 4119620020

———————————————————————————————————–
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
———————————————————————————————————–
| 0 | SELECT STATEMENT | | 96199 | 53M| | 114K (1)| 00:08:48 |
|* 1 | HASH JOIN | | 96199 | 53M| 4984K| 114K (1)| 00:08:48 |
|* 2 | TABLE ACCESS BY INDEX ROWID| PS_PC_RES_PA_TA121 | 96199 | 3851K| | 6517 (1)| 00:00:31 |
|* 3 | INDEX RANGE SCAN | PSAPC_RES_PA_TA121 | 299K| | | 227 (1)| 00:00:02 |
|* 4 | TABLE ACCESS BY INDEX ROWID| PS_PC_RES_PA_TA021 | 785K| 403M| | 72751 (1)| 00:05:36 |
|* 5 | INDEX RANGE SCAN | PSAPC_RES_PA_TA021 | 2902K| | | 2308 (1)| 00:00:11 |
———————————————————————————————————–

Predicate Information (identified by operation id):
—————————————————

1 – access(“PR”.”BUSINESS_UNIT_PO”=”PRT”.”BUSINESS_UNIT_PO” AND “PR”.”PO_ID”=”PRT”.”PO_ID” AND
“PR”.”LINE_NBR”=”PRT”.”LINE_NBR” AND “PR”.”SCHED_NBR”=”PRT”.”SCHED_NBR” AND
“PR”.”DISTRIB_LINE_NUM”=”PRT”.”DISTRIB_LINE_NUM” AND “PR”.”DST_ACCT_TYPE”=”PRT”.”DST_ACCT_TYPE”)
2 – filter(“PRT”.”ANALYSIS_TYPE”=’CCR’ OR “PRT”.”ANALYSIS_TYPE”=’CRV’)
3 – access(“PRT”.”PROCESS_INSTANCE”=28022762)
4 – filter(“PR”.”ANALYSIS_TYPE”=’CCR’ OR “PR”.”ANALYSIS_TYPE”=’CRV’)
5 – access(“PR”.”PROCESS_INSTANCE”=28022762)

Join Predicated & NDVs

PR.BUSINESS_UNIT_PO = 219
PR.PO_ID = 308320
PR.LINE_NBR = 142
PR.SCHED_NBR = 24
PR.DISTRIB_LINE_NUM = 56
PR.DST_ACCT_TYPE = 3

PRT.BUSINESS_UNIT_PO = 177
PRT.PO_ID = 139136
PRT.LINE_NBR = 133
PRT.SCHED_NBR = 23
PRT.DISTRIB_LINE_NUM = 42
PRT.DST_ACCT_TYPE = 2

Filter Predicates & NDVs

PRT.ANALYSIS_TYPE = 3
PR.ANALYSIS_TYPE = 5

Table row count

PS_PC_RES_PA_TA121 PRT Rows= 299160
PS_PC_RES_PA_TA021 PR Rows= 2902548

Thanks

Comment by exitorapsdba — June 5, 2012 @ 12:55 am BST Jun 5,2012 | Reply
- orapsdba,
  
  The first thing to look at is the meaning of “num_distinct” in the join selectivity formula. To understand what this is, you need to read the work on “selection without replacement” done by Alberto Dell’Era.
  
  Once you can calculate the correct selectivity for a single predicate, the selectivity for multiple predicates AND’ed together on a single join is the product of the separate selectivities – but there are several special cases, including the multi-column sanity check, the index sanity check, a limit of 1/num_rows, the effects of histograms, and the effects of Oracle transforming your query through transitive closure: so you do need to check that you are applying the calculations to the right columns.
  
  Depending on the version of Oracle there are also a couple of subtle bugs in the multi-column calculations that can make a (probably) small difference to the results.
  
  Comment by Jonathan Lewis — June 7, 2012 @ 5:44 pm BST Jun 7,2012 | Reply
[…] 11g introduced a feature known as “extended stats” which allows you to create stats on expressions and on column groups. Once you’ve created […]

Pingback by Usage Stats « Oracle Scratchpad — January 24, 2013 @ 7:00 pm GMT Jan 24,2013 | Reply
[…] (it works on 11g, too). All it does is create a table by copying from a well-known table, gather extended stats on a column group, then show you the resulting column names by querying view […]

Pingback by Extended Stats | Oracle Scratchpad — October 4, 2013 @ 9:40 am BST Oct 4,2013 | Reply
[…] that came up in an OTN thread earlier on this week [ed: actually 8th Jan 2014]. When you create column group stats, Oracle uses an undocumented function sys_op_combined_hash() to create a hash value, and if you […]

Pingback by Extended stats | Oracle Scratchpad — May 4, 2014 @ 1:24 pm BST May 4,2014 | Reply
[…] that I’ve got two predicates on both tables so, in the absence of “column-group” extended stats the optimizer will enable cardinality feedback as the query runs to check whether or not its […]

Pingback by Cardinality Feedback | Oracle Scratchpad — November 5, 2014 @ 6:43 pm GMT Nov 5,2014 | Reply
[…] distribution. I might go one step further and create a column group, but keeping an eye open for an “out of range” anomaly, on the pair of columns if there was a strong degree of correlation between the two columns (in […]

Pingback by Understanding SQL | Oracle Scratchpad — June 1, 2015 @ 4:54 pm BST Jun 1,2015 | Reply
[…] plans because of the way they supply better details about cardinality; unfortunately we’ve already seen a few cases (don’t forget to check the updates and comments) where the feature is disabled, and another […]

Pingback by Column Groups | Oracle Scratchpad — November 5, 2015 @ 6:48 am GMT Nov 5,2015 | Reply
[…] plans because of the way they supply better details about cardinality; unfortunately we’ve already seen a few cases (don’t forget to check the updates and comments) where the feature is disabled, and another […]

Pingback by Oracle: Column Groups | Dinesh Ram Kali. — November 5, 2015 @ 6:59 pm GMT Nov 5,2015 | Reply
[…] to be very careful about time-based or sequence-based columns that can go out of range: if you have a column group that includes the column then the optimizer stops using the column group the moment you go out of range. This is probably a […]

Pingback by Upgrades | Oracle Scratchpad — December 10, 2015 @ 8:42 am GMT Dec 10,2015 | Reply
[…] also highlighted various warnings (here (with several follow-on URLs) and here) about when the optimizer declines to use column group […]

Pingback by Index out of range | Oracle Scratchpad — March 27, 2017 @ 8:43 am BST Mar 27,2017 | Reply
[…] Extended Stats (April 2012): The stats on a column group will not be used if a predicate on an underlying column queries values outside the low/high range […]

Pingback by Column Group Catalog | Oracle Scratchpad — September 27, 2018 @ 5:16 pm BST Sep 27,2018 | Reply
[…] the optimizer to use the column group). We’d also have to make sure that the queried values didn’t fall outside the known low/high values for the columns if we wanted the column group to be […]

Pingback by Execution Plans | Oracle Scratchpad — July 9, 2020 @ 4:54 pm BST Jul 9,2020 | Reply