Oracle Scratchpad

February 13, 2008

Statistics problem

Filed under: Oracle,Performance,Statistics,trace files,Troubleshooting — Jonathan Lewis @ 12:38 pm GMT Feb 13,2008

The worst type of Oracle bug is the one that seems to appear randomly and can’t be reproduced on demand. (Such as when Oracle support says “please send us a reproducible test case”).

Here’s one such (probable) bug that showed up at a client site that was reporting performance problems with a query that, on random days, chose a bad execution plan. The client was running one of the earlier versions of 9.2, and using the following call to dbms_stats to collect fresh table stats for each table in turn every night.

BEGIN
	dbms_stats.gather_table_stats(
		ownname			=> user,
		tabname			=> 'ABCDEFG',
		cascade			=> true,
		estimate_percent	=> dbms_stats.auto_sample_size,
		method_opt		=> 'for all indexed columns'
	);
END;
/

The trouble was that one of the indexes would occasionally end up with the wrong value for the number of distinct keys (user_indexes.distinct_keys) – and if this number was too low the cost of using the index for a particularly important query would become too high and the execution path would change from good to very bad.

Of course, for an index on a single column with a highly skewed distribution, an unlucky sample could introduce a bad estimate of the distinct_keys value – but the auto_sample_size is supposed to address this issue by making Oracle collect statistics at steadily growing sample sizes until it gets a stable result.

Running the procedure on the critical table showed queries like the following in a 10046 trace:

select
	/*+
		cursor_sharing_exact dynamic_sampling(0)
		no_monitoring no_expand index_ffs(t,"ABCDEFG_IND")
	*/
	count(*) as nrw,
	count(distinct sys_op_lbid(57954,'L',t.rowid)) as nlb,
	null as ndk,
	null as clf
from
	"THE_OWNER"."ABCDEFG"  sample block (   .0000100000)  t
where	"COLX" is not null
;

(See this note on my website – now corrected and enhanced in the script catalog – about how to use the sys_op_lbid() function to do a detailed check on the efficiency of an index.)

This query appeared three times in a row – with the sample size going through 0.00001%, 0.001%, and 0.1%. Then the sample jumped to roughly 15%, and the expressions aliased as ndk (number of distinct keys) and clf (clustering factor) changed to:

	count(distinct "COLX") as ndk,
	sys_op_countchg(substrb(t.rowid,1,15),1) as clf

(The sys_op_countchg() function is described in chapter 5 of my book Cost Based Oracle – Fundamentals. This chapter is available as a pdf in this (zip) download from Apress.)

The final run of the query didn’t use the sample clause at all, so ran against 100% of the index, and I actually produced a (traced) run where Oracle ran this query, then wrote into the database a “sample” (num_rows) that was correct – matching the number of entries in the index – but a number of distinct keys that was wrong by a factor of five (reporting a number close to 100,000 instead of 500,000) .

Everything in the trace file suggested that the correct SQL statement was running, and the amount of work done by that SQL was appropriate – yet sometimes the results stored in the database were wrong. The value for num_distinct for the colx  column, on the other hand, always seemed to be correct.

As a production issue there was a simple workaround – whenever gathering stats on this table, follow the gather_table_stats() call with a call to gather_index_stats() for just that index, using the compute option. The extra gather took just a few seconds to run and seemed to get the right figures every time.

Having found a workaround for the direct cause of the issue, the next problem was to find out whether the bug had been fixed or not.

I couldn’t reproduce the effect on a 9.2.0.8 database when I tried to approximate the data sizes and distribution.

I couldn’t experiment with the client’s production data to see if I could isolate what caused the problem.

I couldn’t find the problem mentioned on Metalink in the bug search or on the patch notes.

So maybe it’s a bug that’s still around – and maybe it happens to lots of systems all the time but doesn’t often get noticed; or maybe it’s a bug that needs a very specific set of coincidences to occur before it happens.

Whatever the case, the next time you see a query suddenly producing a dramatic performance swing because of a change in execution plan, check if the stats have just been collected and if the collection method was to gather table stats or schema stats with the auto_sample_size option. Then check exactly what the statistics on all the indexes on all the tables in that query look like – if the num_rows is accurate but the distinct_keys is badly off, you may have an example of the same problem.

13 Comments »

  1. Hi Jonathan,

    Very informative. Can you please explain this parameter DBMS_STATS.AUTO_INVALIDATE, which will leave Oracle decide when to invalidate dependent cursors.
    Gathering statistics normally invalidates the cursors that are dependent that object we just gathered stats on. With DBMS_STATS.AUTO_INVALIDATE, in which cases Oracle will NOT invalidate dependent cursors?

    Comment by Ying — April 5, 2008 @ 1:32 pm GMT Apr 5,2008 | Reply

  2. Ying,

    According to the notes in the creation script, Oracle will invalidate relevant cursors over a period of time.

    This is controlled by the hidden parameter _optimizer_invalidation_period, which defaults to 18,000 seconds (five hours) and has the description: “time window for invalidation of cursors of analyzed objects”

    I read something about this quite recently – possibly on Greg Rahn’s blog – about how cursors were selected (randomly) and how you recognised a cursor that had been recreated because of this rolling invalidation (by checking v$sql_shared_cursor.roll_invalid_mismatch = ‘Y’).

    Comment by Jonathan Lewis — April 16, 2008 @ 7:06 pm GMT Apr 16,2008 | Reply

  3. That article about sys_op_lbid is a beauty, as is this one.

    Comment by Paul — July 30, 2009 @ 1:38 am GMT Jul 30,2009 | Reply

  4. sys_op_countchg()

    A quick search on MetaLink shows precious little (practically nothing) about sys_op_countchg() – so, is it wise to promote its usage? Is it something that looks good, but could go nasty in a next release/patchset?

    Comment by Nigel — November 9, 2009 @ 1:21 pm GMT Nov 9,2009 | Reply

    • Nigel,

      I think you’re probably thinking of Chapter 5 of Cost Based Oracle – Fundamentals (the chapter that can be downloaded as a free pdf file from Apress) where I suggest that we could “hi-jack” the sys_op_countchg() function used by Oracle to calculate the clustering_factor of an index.

      You’re probably correct to suggest that I could have made some comments about the function being undocumented and that my suggestion to trace the gather_index_stats() call should be repeated on every upgrade of Oracle to ensure that the mechanism was still appropriate.

      In fact, as Richard Foote points out in one of the comments to his postings on Oracle Open World 2009, this may be particularly significant in some future (12+) version of Oracle because the Optimizer Group is looking at better ways of calculating the clustering_factor (and may, apparently, do something that looks remarkably similar to the suggestion I voiced in the book).

      Comment by Jonathan Lewis — November 9, 2009 @ 6:40 pm GMT Nov 9,2009 | Reply

  5. […] hours, re-execute the query, and then check V$SQL_SHARED_CURSOR.  The delay is mentioned here and here, along with a hidden _optimizer_invalidation_period parameter that controls how long after […]

    Pingback by Duplicates when Querying V$TEMPSEG_USAGE? « Charles Hooper's Oracle Notes — March 22, 2010 @ 6:03 am GMT Mar 22,2010 | Reply

  6. I know it is too late to comment now :-) but I thought I should share my experiance with you.
    I have faced this issues many times on Oracle 10gR2 database for E-business Suite, and it seems some how under some condition which I’m not able to findout until now, gather stats job does not work as expected.
    And to correct this issue I alway use Analyze table .. compute statistics.

    Regards,
    Shadi

    Comment by Shadi — March 23, 2010 @ 10:08 am GMT Mar 23,2010 | Reply

    • Shadi,

      There are various diffrences between analyze and dbms_stats – not the least of which is what they collect stats on by default. In 10g the default is for dbms_stats to collect histograms which analyze does not do, and that particular difference has affected a lot of people badly in the past.

      In your case there’s also the possible difference due to an estimate being taken by dbms_stats, compared to your compute from analyze.

      If you check the different behaviour between the two commands by testing on a small test schema you may be able to identify the specific detail thats been giving you problems.

      Comment by Jonathan Lewis — March 23, 2010 @ 9:36 pm GMT Mar 23,2010 | Reply

  7. […] procedure counts the number of used leaf blocks and the number of distinct keys using a count distinct operation, which means you get an expensive […]

    Pingback by I wish « Oracle Scratchpad — December 13, 2011 @ 6:09 pm GMT Dec 13,2011 | Reply

  8. Hi Jonathan,

    Is there a to change the plan of the running sql? (it is running for 8 hours, still it has long way to go). It picked up wrong plan due to wrong statistics. Is there a way to force the running sql to take new plan that we have identified.

    Thank you
    Suresh

    Comment by suresh — July 15, 2012 @ 10:18 pm GMT Jul 15,2012 | Reply

  9. Hi Jonathan,

    I caught this discussion following an issue I met with Oracle SPA. I’m testing ugrade from 10.2.0.4 to 11.2.0.3.

    The second trial on 11g triggered a list of failing instructions. All of them using sys_op_lbid and getting an ORA-01760. The only note (ID 317254.1) I found on Metalink didn’t help.Sorry if I’m out-of-scope of the discussion but more than interested in any clue you may have.

    Thank you,
    Guillaume

    Comment by Guillaume — July 27, 2012 @ 1:08 pm GMT Jul 27,2012 | Reply

    • Guillaume,

      Two points – I have no idea what “a list of failing instructions” means (is it a set of SQL statements that raise errors), and I don’t have total recall of all the Oracle error messages – so you might consider including the error text when quoting the error number.

      In your position I would set event 10046 at level 4 (bind variable capture) when running the code that fails, and then read the trace file. I would hope to see the actual failing statement (with the error code reported) and capture the current values of the bind variables that caused the error.

      Comment by Jonathan Lewis — August 7, 2012 @ 6:11 pm GMT Aug 7,2012 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,429 other followers