The worst type of Oracle bug is the one that seems to appear randomly and can’t be reproduced on demand. (Such as when Oracle support says “please send us a reproducible test case”).
Here’s one such (probable) bug that showed up at a client site that was reporting performance problems with a query that, on random days, chose a bad execution plan. The client was running one of the earlier versions of 9.2, and using the following call to dbms_stats to collect fresh table stats for each table in turn every night.
BEGIN dbms_stats.gather_table_stats( ownname => user, tabname => 'ABCDEFG', cascade => true, estimate_percent => dbms_stats.auto_sample_size, method_opt => 'for all indexed columns' ); END; /
The trouble was that one of the indexes would occasionally end up with the wrong value for the number of distinct keys (user_indexes.distinct_keys) – and if this number was too low the cost of using the index for a particularly important query would become too high and the execution path would change from good to very bad.
Of course, for an index on a single column with a highly skewed distribution, an unlucky sample could introduce a bad estimate of the distinct_keys value – but the auto_sample_size is supposed to address this issue by making Oracle collect statistics at steadily growing sample sizes until it gets a stable result.
Running the procedure on the critical table showed queries like the following in a 10046 trace:
select /*+ cursor_sharing_exact dynamic_sampling(0) no_monitoring no_expand index_ffs(t,"ABCDEFG_IND") */ count(*) as nrw, count(distinct sys_op_lbid(57954,'L',t.rowid)) as nlb, null as ndk, null as clf from "THE_OWNER"."ABCDEFG" sample block ( .0000100000) t where "COLX" is not null ;
This query appeared three times in a row – with the sample size going through 0.00001%, 0.001%, and 0.1%. Then the sample jumped to roughly 15%, and the expressions aliased as ndk (number of distinct keys) and clf (clustering factor) changed to:
count(distinct "COLX") as ndk, sys_op_countchg(substrb(t.rowid,1,15),1) as clf
(The sys_op_countchg() function is described in chapter 5 of my book Cost Based Oracle – Fundamentals. This chapter is available as a pdf in this (zip) download from Apress.)
The final run of the query didn’t use the sample clause at all, so ran against 100% of the index, and I actually produced a (traced) run where Oracle ran this query, then wrote into the database a “sample” (num_rows) that was correct – matching the number of entries in the index – but a number of distinct keys that was wrong by a factor of five (reporting a number close to 100,000 instead of 500,000) .
Everything in the trace file suggested that the correct SQL statement was running, and the amount of work done by that SQL was appropriate – yet sometimes the results stored in the database were wrong. The value for num_distinct for the colx column, on the other hand, always seemed to be correct.
As a production issue there was a simple workaround – whenever gathering stats on this table, follow the gather_table_stats() call with a call to gather_index_stats() for just that index, using the compute option. The extra gather took just a few seconds to run and seemed to get the right figures every time.
Having found a workaround for the direct cause of the issue, the next problem was to find out whether the bug had been fixed or not.
I couldn’t reproduce the effect on a 188.8.131.52 database when I tried to approximate the data sizes and distribution.
I couldn’t experiment with the client’s production data to see if I could isolate what caused the problem.
I couldn’t find the problem mentioned on Metalink in the bug search or on the patch notes.
So maybe it’s a bug that’s still around – and maybe it happens to lots of systems all the time but doesn’t often get noticed; or maybe it’s a bug that needs a very specific set of coincidences to occur before it happens.
Whatever the case, the next time you see a query suddenly producing a dramatic performance swing because of a change in execution plan, check if the stats have just been collected and if the collection method was to gather table stats or schema stats with the auto_sample_size option. Then check exactly what the statistics on all the indexes on all the tables in that query look like – if the num_rows is accurate but the distinct_keys is badly off, you may have an example of the same problem.