Oracle Scratchpad

April 11, 2012

Extended Stats

Filed under: CBO,Oracle,Statistics — Jonathan Lewis @ 11:37 pm BST Apr 11,2012

I’m very keen on the 11g extended stats feature, but I’ve just discovered a critical weakness in one of the implementation details that could lead to some surprising instability in execution plans. It’s a combination of “column group” statistics and “out of range” predicates. Let’s start with  some sample data. (Note: I have run this test on 11.2.0.4 and 12.1.0.1):

create table t1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 10000
)
select
	mod(rownum,100)		col1,
	mod(rownum,10)		col2
from
	generator	v1,
	generator	v2
where
	rownum <= 50000
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

I’ve got 100 distinct values for col1 and 10 distinct values for col2 – so the optimizer will assume that I’ve got 1,000 distinct values for the combination. Since there are 50,000 rows this means 50 rows per combination. So let’s see this demonstrated:

set autotrace traceonly explain

select	*
from	t1
where
	col1 = 50
and	col2 = 5
;

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    50 |   300 |    13 |
|*  1 |  TABLE ACCESS FULL| T1   |    50 |   300 |    13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("COL1"=50 AND "COL2"=5)

If you examine my data definition, though, you will see that the col1 and col2 values align in a way that means there are really only 100 distinct combination, with 500 rows per combination. This looks like one of those cases where extended statistics might be useful. So let’s create a suitable set:

begin
	dbms_output.put_line(
		dbms_stats.create_extended_stats(
			ownname		=> user,
			tabname		=> 'T1',
			extension	=> '(col1, col2)'
		)
	);

	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		block_sample 	 => true,
		method_opt 	 => 'for all hidden columns size 1'
	);
end;
/

With these stats in place, I repeat my query with the following effect:


Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   500 |  3000 |    13 |
|*  1 |  TABLE ACCESS FULL| T1   |   500 |  3000 |    13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("COL1"=50 AND "COL2"=5)

Oracle has used the “column group” statistics to derive a good estimate of the combined selectivity and produced an estimate of 500 rows. But what happens as new data (following the same pattern) is entered, and the queries change to keep up with the data; consider, for example, the slightly artificial query:

select	*
from	t1
where
	col1 = 50
and	col2 = 9.001
;

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost  |
----------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    50 |   300 |    13 |
|*  1 |  TABLE ACCESS FULL| T1   |    50 |   300 |    13 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("COL1"=50 AND "COL2"=9.001)

I’ve used the value 9.001 because it is only just outside the known value range for col2 (which ranges from 0 to 9) – and the optimizer has immediately switched from the extended stats back to the single column stats and the “independent predicates” mechanism to estimate the cardinality, leading to a dramatic change in the estimate.

This strategy may be a deliberate choice on the part of the developers, but personally I think that it’s not appropriate. If you’ve created “column group” stats like this it seems reasonably likely that you’re trying to show Oracle a consistent pattern in the data; however it’s quite possible that the data you’re using shows some sort of time-dependency such that the pattern doesn’t change, but the actual values in (at least one) of the columns keeps increasing.

In this case, applying the normal “linear decay” strategy to the extended stats seems much more sensible – if one (or both) of the predicates is slightly out of range then use the extended stats to derive an estimate then apply the decay factor(s) to that value, rather than falling back on multiplying the individual selectivities.

(The case were I saw this problem was one where rows were constantly being inserted and deleted. One column was a time-based audit id, the other was an action id, and every audit id produced a few dozen action ids, with some overlap of actions across audit ids. New, higher valued, audit rows were constantly being added to this table, while old, lower ones, were being deleted. The effect was that the extended stats because useless within seconds of being generated.)

Footnote:

Coincidentally, Randolf Geist highlighted another edge case a couple of days ago – but it hasn’t stopped me thinking that extended stats, and “column group” stats especially are a terrific feature.

Update (Feb 2014)

Christian Antognini has come up with another example of how extended stats might  not behave as you expect; in this case relating to side effects of histograms. (With a reference to  Bug 6972291.) 

23 Comments »

  1. Hi Jonathan,

    This behavior is expected and it was a deliberate design decision to prevent over-estimations when one of the values supplied is ‘out of range’. We can’t ignore the ‘out of range’ scenario just because we have a column group. Extended statistics do not contain the min, max values for the column group so we rely on the individual column statistics to check for ‘out of range’ scenarios like yours. When one of the columns is ‘out of range’, we revert back to the column statistics because we know it is going to generate a lower selectivity range and if one of the columns is ‘out of range’ then the number of rows returned will be lower or none at all, as in your example.

    Thanks,
    Maria

    Comment by Maria Colgan — April 12, 2012 @ 12:47 am BST Apr 12,2012 | Reply

    • Maria,

      Thanks for the feedback. I’d like to present an argument for a different strategy.

      Since I’m only likely to create a column group because I know that the product of the two selectivities is going to be wrong, I would expect that reverting back to the product of selectivities will still be wrong if the data moves out of range.

      It seems more appropriate (or more consistent with the linear decay strategy) to stick with the column group selectivity, but work out the separate linear decay factors of the individual columns and apply their product to the column group selectivity.

      It crosses my mind, also, that anyone who is NOT using the “approximate NDV” approach (i.e auto_sample_size) of 11g may find that they don’t always capture the min and max values when they gather statistics – which means they could run queries that ARE within the limits, but don’t appear to be, protentially resulting in extreme changes in plans for “no apparent reason” after they’ve collected an unlucky set of statistics. (And in this case I’m not thinking about data sets where one or more of the columns suffers from the problems of continually increasing values.)

      Comment by Jonathan Lewis — April 12, 2012 @ 8:58 am BST Apr 12,2012 | Reply

  2. Thanks for pointing this out, Jonathan. Yet another case to use manual adjustments to the high_value of columns – as all should do manually, not automatic ;-)

    Maria,
    a question/wish to CBO dev team :) It would be a nice feature to add a data dictionary flag for column that have always increasing values, so that CBO will not do “linear decay” for them. It’s quite easy to auto-detect such columns automatically in the automatic job; and it will save many routine actions in the stats management area.

    Comment by Timur Akhmadeev — April 12, 2012 @ 7:54 am BST Apr 12,2012 | Reply

    • Hi,

      +1 for Timur. It was my first idea few moments after linear decay has been introduced (I think it was introduced to 10gR1 in PATCHSET 2 or 3). linear decay is nice feature, unless pk of table is based on sysdate/seuqence etc.

      Comment by Pavol Babel — April 12, 2012 @ 8:09 pm BST Apr 12,2012 | Reply

  3. It may in fact be useful to have a column attribute, whether or not associated with column groups, that indicates the actual high and/or low values should be tracked in the dictionary. While recognizing the danger of indiscriminately adding a dictionary update to routine inserts and deletes, it is the sort of feature where it is difficult to do an unbiased gedanken experiment because real world data varies so much. Presumably one update per commit would be enough, since you’re making a promise to the optimizer, not the user, about the exact high and/or low value. In JL’s “somewhat artificial” query the cardinality estimate would then be zero if the high value of col2 were being tracked, correct?

    mwf

    Comment by Mark Farnham — April 12, 2012 @ 1:27 pm BST Apr 12,2012 | Reply

  4. should have written at most one update per commit, since you would only update if the high value being tracked was exceeded or the lowest value and all ties was removed. Yet another time delete would be more expensive than insert if you had to check for ties.

    Comment by Mark Farnham — April 12, 2012 @ 1:30 pm BST Apr 12,2012 | Reply

  5. “If you examine my data definition, though, you will see that the col1 and col2 values align in a way that means there are really only 100 distinct combination, with 500 rows per combination.”

    Where there is correlation between columns, probably indicates some inadequately modelled functional dependency that possibly should have been normalised. I have a general concern with introducing optimiser features that attempt to smooth out design issues. Because as far as I can observe historically this encourages more bad designs.

    Comment by Adrian — April 13, 2012 @ 12:58 pm BST Apr 13,2012 | Reply

    • Adrian,

      The code I used to generate the data was simply a mechanism to ensure that I had two columns where the product of selectivities was not the selectivity of the combination, so I wouldn’t worry too much about arguing whether that specific model was right or wrong.

      In general, though, I think you are mixing correlation and causation a little in your assessment: two columns can be correlated because they are dependent on the same key; consider, for example, a delivery service that records a pick-up date and a delivery date against each parcel.

      Comment by Jonathan Lewis — April 15, 2012 @ 6:21 pm BST Apr 15,2012 | Reply

  6. Gee whiz, I think it is pretty common to have some correlations that are NOT appropriate to model. Any values that tend, for example, to always go up are correlated in a way that the optimizer might put to good use. But they might go up at different rates and be nothing you could constrain or describe in a model rather than in the relationship of the actual values, simply because both going up all the time is a coincidence, not a business rule.

    Comment by Mark Farnham — April 13, 2012 @ 2:34 pm BST Apr 13,2012 | Reply

  7. Possibly some truth in what you say there. But often things are unecessarily forced into this category by the deliberate and sometimes inappropriate use of things that we know will fit this category and that could be easily avoided at design. “Things that go up all the time” in my experience are often not coincidental but rather a part of some physical design/imeplementation choices that could easily have been avoided. Often as a function of applying arbitrary surrogate identifiers. But I was trying to avoid straying into that debate by being vague.

    Comment by Adrian — April 13, 2012 @ 2:41 pm BST Apr 13,2012 | Reply

  8. The logic of my point is that you’d have to win the debate that there are NOT correlations in real world data that are inappropriate to model or constrain to be on firm ground in contending this is not a useful optimizer feature. That is very different from often observing correlations that are an artifact of bad design. I have no doubt you see them a lot. I vote for good models, Mom, and apple pie. I think the potential affect of extended statistics and column groups as they are perfected is likely to be immense for models good and bad.

    Comment by Mark Farnham — April 13, 2012 @ 3:16 pm BST Apr 13,2012 | Reply

  9. Maybe. Perhaps I’d only have to win the debate that there are not sufficient cases in a sensibly modelled situation to warrant development effort having been spent “solving” the “problem” rather than, say fixing some more of the bugs in existing functionality ;). Especially if the feature was going to deliberately not work (as indicated) in what may turn out to be the most common realisation of this scenario.
    Possibly I shouldn’t have posted this observation here because it’s tangential to the point being made anyway.
    I am intrigued at the idea of correlations that shouldn’t be modelled though, surely all non key attributes should be dependent only on the key rather than each other, so aside from components in a composite identifier or the rareish cases where there are multiple candidates for a key, that just leaves design errors.

    Comment by Adrian — April 13, 2012 @ 3:44 pm BST Apr 13,2012 | Reply

  10. [...] Jonathan Lewis just discovered a critical weakness in one of the implementation details of 11g extended stats that could lead to some surprising instability in execution plans. [...]

    Pingback by Log Buffer #267, A Carnival of the Vanities for DBAs | The Pythian Blog — April 14, 2012 @ 12:19 am BST Apr 14,2012 | Reply

  11. “In general, though, I think you are mixing correlation and causation a little in your assessment: two columns can be correlated because they are dependent on the same key; consider, for example, a delivery service that records a pick-up date and a delivery date against each parcel.”

    Sure. But surely relying on a correlation rather than causation is problematic? Isn’t creating stats to represent a non causative relationship spectacularly falling prey to a logical fallacy? What if some absent, countervailing variable changes? such as “time” no longer necessarily advancing because the company starts delivering across time zones but continues to use the system that records only local delivery times without a timezone? or starts storing goods for long dated future deliveries? or hands off parts of the logistics to a third party and have large number of completed deliveries with no recorded delivery time? or an artificial relationship caused by ascending keys is suddenly invalidated by a sequence wrap?

    Would it not be better to enable query efficiency by modelling and constraining known relationships rigorously and encouraging the optimiser to take advantage of defined constraints? For example: a priority parcel delivery has an SLA of two working days. If we’re querying for unsuccessful priority deliveries we instantly know something about the relationship of the dates that would be more expensive and complicated to capture automatically by correlation, and when the SLA changes I know which set of assumptions is quicker to recalculate. If we know enough about a data relationship to create a column group over it, perhaps it’s better to let us apply the knowledge we have rather than have the optimiser need to recalculate it for us (via expensive disk reads).

    Comment by Adrian — April 23, 2012 @ 10:12 am BST Apr 23,2012 | Reply

    • Adrian,
      Your earlier objection was that:

      “Where there is correlation between columns, probably indicates some inadequately modelled functional dependency that possibly should have been normalised.”

      Now that I’ve pointed out that it is perfectly feasible to find correlation between measurements that have a common cause (which isn’t a logical fallacy, by the way), your argument is that the business definition may change. But if you change the business definition then you should go back and review the data model – and that’s where any information about the impact on the previously recognised correlation would be re-considered.

      Comment by Jonathan Lewis — April 23, 2012 @ 1:47 pm BST Apr 23,2012 | Reply

    • And if correlations change, not because of a fundamental change in the business that should be remodeled, but simply because of a change in actual operations, that is exactly why capturing the information about the correlations as a fact of statistics is a good thing. As metadata, if it becomes no longer useful it can simply go away, and as with all statistics if they become stale or misleading they can be fixed.

      Comment by Mark Farnham — April 23, 2012 @ 1:55 pm BST Apr 23,2012 | Reply

  12. The logical fallacy I was pointing to was the inference that an observed correlation has an immutable cause which is reliable enough for us to base our performance strategy on (or anything on for that matter). For example the correlation between house prices and time caused a certain amount of consternation recently for folks who should’ve known better. On the other hand if we know for certain there is an actual correlation then my argument is that there are more efficient ways of helping out an optimiser than by having it collect stats, which will all know to be an expensive operation in a system of any size, simply to learn about exactly the relationship of which we are already aware.Anyway, thanks both for indulging me with replies. I get the feeling we’re not going to agree.

    Comment by Adrian — April 24, 2012 @ 3:04 pm BST Apr 24,2012 | Reply

  13. Hi Jonathan,
    Referring to your book CBO on page 269, “Unfortunately there are problems that still need addressing—lots of them. Let’s try to find
    a few questions about the limitations of the formulae.
    • What are you supposed to do if you have two or more join columns?”

    I would like to know whether the standard join cardinality formula is applicable if I have more than one join columns?Could you please explain join cardinality for 3 or more join columns.

    When i tried with the following SQL,I couldn’t do it.

    SQL> @xplan
    Plan hash value: 4119620020

    ———————————————————————————————————–
    | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
    ———————————————————————————————————–
    | 0 | SELECT STATEMENT | | 96199 | 53M| | 114K (1)| 00:08:48 |
    |* 1 | HASH JOIN | | 96199 | 53M| 4984K| 114K (1)| 00:08:48 |
    |* 2 | TABLE ACCESS BY INDEX ROWID| PS_PC_RES_PA_TA121 | 96199 | 3851K| | 6517 (1)| 00:00:31 |
    |* 3 | INDEX RANGE SCAN | PSAPC_RES_PA_TA121 | 299K| | | 227 (1)| 00:00:02 |
    |* 4 | TABLE ACCESS BY INDEX ROWID| PS_PC_RES_PA_TA021 | 785K| 403M| | 72751 (1)| 00:05:36 |
    |* 5 | INDEX RANGE SCAN | PSAPC_RES_PA_TA021 | 2902K| | | 2308 (1)| 00:00:11 |
    ———————————————————————————————————–

    Predicate Information (identified by operation id):
    —————————————————

    1 – access(“PR”.”BUSINESS_UNIT_PO”=”PRT”.”BUSINESS_UNIT_PO” AND “PR”.”PO_ID”=”PRT”.”PO_ID” AND
    “PR”.”LINE_NBR”=”PRT”.”LINE_NBR” AND “PR”.”SCHED_NBR”=”PRT”.”SCHED_NBR” AND
    “PR”.”DISTRIB_LINE_NUM”=”PRT”.”DISTRIB_LINE_NUM” AND “PR”.”DST_ACCT_TYPE”=”PRT”.”DST_ACCT_TYPE”)
    2 – filter(“PRT”.”ANALYSIS_TYPE”=’CCR’ OR “PRT”.”ANALYSIS_TYPE”=’CRV’)
    3 – access(“PRT”.”PROCESS_INSTANCE”=28022762)
    4 – filter(“PR”.”ANALYSIS_TYPE”=’CCR’ OR “PR”.”ANALYSIS_TYPE”=’CRV’)
    5 – access(“PR”.”PROCESS_INSTANCE”=28022762)

    Join Predicated & NDVs

    PR.BUSINESS_UNIT_PO = 219
    PR.PO_ID = 308320
    PR.LINE_NBR = 142
    PR.SCHED_NBR = 24
    PR.DISTRIB_LINE_NUM = 56
    PR.DST_ACCT_TYPE = 3

    PRT.BUSINESS_UNIT_PO = 177
    PRT.PO_ID = 139136
    PRT.LINE_NBR = 133
    PRT.SCHED_NBR = 23
    PRT.DISTRIB_LINE_NUM = 42
    PRT.DST_ACCT_TYPE = 2

    Filter Predicates & NDVs

    PRT.ANALYSIS_TYPE = 3
    PR.ANALYSIS_TYPE = 5

    Table row count

    PS_PC_RES_PA_TA121 PRT Rows= 299160
    PS_PC_RES_PA_TA021 PR Rows= 2902548

    Thanks

    Comment by orapsdba — June 5, 2012 @ 12:55 am BST Jun 5,2012 | Reply

    • orapsdba,

      The first thing to look at is the meaning of “num_distinct” in the join selectivity formula. To understand what this is, you need to read the work on “selection without replacement” done by Alberto Dell’Era.

      Once you can calculate the correct selectivity for a single predicate, the selectivity for multiple predicates AND’ed together on a single join is the product of the separate selectivities – but there are several special cases, including the multi-column sanity check, the index sanity check, a limit of 1/num_rows, the effects of histograms, and the effects of Oracle transforming your query through transitive closure: so you do need to check that you are applying the calculations to the right columns.

      Depending on the version of Oracle there are also a couple of subtle bugs in the multi-column calculations that can make a (probably) small difference to the results.

      Comment by Jonathan Lewis — June 7, 2012 @ 5:44 pm BST Jun 7,2012 | Reply

  14. [...] 11g introduced a feature known as “extended stats” which allows you to create stats on expressions and on column groups. Once you’ve created [...]

    Pingback by Usage Stats « Oracle Scratchpad — January 24, 2013 @ 7:00 pm BST Jan 24,2013 | Reply

  15. […] (it works on 11g, too). All it does is create a table by copying from a well-known table, gather extended stats on a column group, then show you the resulting column names by querying view […]

    Pingback by Extended Stats | Oracle Scratchpad — October 4, 2013 @ 9:40 am BST Oct 4,2013 | Reply

  16. I’ve just added a link to Christian Antognini’s recent post on an odd feature of mixing extended stats and histograms.

    Comment by Jonathan Lewis — February 23, 2014 @ 10:28 am BST Feb 23,2014 | Reply

  17. […] that came up in an OTN thread earlier on this week [ed: actually 8th Jan 2014]. When you create column group stats, Oracle uses an undocumented function sys_op_combined_hash() to create a hash value, and if you […]

    Pingback by Extended stats | Oracle Scratchpad — May 4, 2014 @ 1:24 pm BST May 4,2014 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 4,012 other followers