Oracle Scratchpad

August 9, 2012

Debugging

Filed under: Bugs,Execution plans,Oracle,subqueries — Jonathan Lewis @ 7:04 pm BST Aug 9,2012

One of my recent assignments involved a company that had run into some performance problems after upgrading from 10.2.0.3 to 11.2.0.2. We had spent half an hour on the phone discussing the system before I had arrived, and I’d made a couple of suggestions that had solved most of their problems before I got on site – but they still wanted me to come in and give them some specific ideas about why the critical part of the solution had helped.

The most critical piece of advice I had given them (after listening very carefully to their description of the system) was to get rid of ALL the histograms they had on their system, and then watch very carefully for any signs that they might need to re-introduce a handful of histograms over the next few weeks.

One of their critical queries completed in less that 2 seconds when histograms were removed, but took 33 seconds to complete when histograms were in place. With their permission, the following notes record my investigation of this puzzle, the underlying Oracle bug (possibly not yet documented) that caused it, and the optimum workaround that was available to them.

The Query

The application vendor had created a number of views that were designed to hide complexity from the Java programmers, so the problem query looked very simple:

SELECT
	*
FROM
	wltuser.VLADUNGEN_FREI
WHERE
	MATERIALNR = '117098'
AND	wltuser.LVSKOMM.FuncSel_BatchReservationOK0(14536, LadungNr) = 0
AND	wltuser.LVSKOMM.FuncSel_QAReservationOK0(14536, ChargeNr, StatusQS) = 0
AND	1=1
AND	((UPPER (zugeordnetfktbereich) LIKE  UPPER ('Warehouse_G') ))
ORDER BY
	ladungnr
;

There are a couple clues in this text suggesting that it’s generated code – the odd “1=1″ construct, the quirky mixture of case (particularly visible in the first predicate – which also look suspect because the name suggests “number” and the comparison is with a number that has been turned into a character string). Moreover, this is just one of several variants of the same query using different constants. (A significant fraction of the work done by the system is “hard parsing”- and in this case that’s an issue worth pursuing with the vendor.)

You’ve probably also noticed two packaged functions in the where clause (such things can be really nasty: what do they do and how much “hidden” time do they take) and a possible “index-busting” upper() function.

As a last little complication, you might note that the vendor is German (or, at least Germanic). It slows me down a bit when I have to keep asking questions like “what does VLADUNGEN_FREI mean?”; it’s much easier to recognise threats in execution plans if you have a more intuitive understanding of what the tables and columns represent.

The client had previous traced various slow queries, by the way, and had noted that a single execution of some “user” queries would result in literally tens of thousands of executions of other (very simple, very precise, very efficent – yet threateningly repetitive) queries. Several of the pl/sql functions, like the ones above, were used by the system to perform simple lookups against the database – which were then used in where clauses to eliminate data. (Tip: sometimes you can declare such functions to be deterministic, sometimes (select function() from dual) is much more efficient than a simple call to the function).

The execution plans.

Since it was easy to reproduce both execution plans (by dropping or recreating histograms) the obvious first step in the investigation was to compare execution plans, ideally with rowsource execution statistics enabled. So that’s what we did in a backup copy of the production data. Here’s what we got – first the bad plan:

-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                         |      1 |        |     61 |00:00:00.03 |    1821K|
|   1 |  SORT AGGREGATE                         |                         |     61 |      1 |     61 |00:00:00.01 |     129 |
|   2 |   TABLE ACCESS BY INDEX ROWID           | LADUNGEN                |     61 |      8 |    266 |00:00:00.01 |     129 |
|*  3 |    INDEX RANGE SCAN                     | LADSAMMELINDEX          |     61 |      8 |    266 |00:00:00.01 |      54 |
|*  4 |  FILTER                                 |                         |      1 |        |     61 |00:00:00.03 |    1821K|
|   5 |   NESTED LOOPS OUTER                    |                         |      1 |      1 |     83 |00:00:00.03 |    1821K|
|   6 |    NESTED LOOPS ANTI                    |                         |      1 |      1 |     83 |00:00:33.95 |    1821K|
|*  7 |     FILTER                              |                         |      1 |        |     83 |00:00:33.95 |    1821K|
|   8 |      NESTED LOOPS OUTER                 |                         |      1 |      1 |     83 |00:00:00.03 |    1821K|
|*  9 |       FILTER                            |                         |      1 |        |     83 |00:00:33.95 |    1821K|
|  10 |        NESTED LOOPS OUTER               |                         |      1 |      1 |     98 |00:00:00.03 |    1820K|
|  11 |         NESTED LOOPS                    |                         |      1 |      1 |     98 |00:00:18.40 |    1820K|
|  12 |          NESTED LOOPS                   |                         |      1 |      1 |     98 |00:00:18.39 |    1820K|
|  13 |           TABLE ACCESS BY INDEX ROWID   | LVSMATERIALSTAMM        |      1 |      1 |      1 |00:00:00.01 |       3 |
|* 14 |            INDEX UNIQUE SCAN            | LVSMATERIALSTAMMPK      |      1 |      1 |      1 |00:00:00.01 |       2 |
|* 15 |           TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |      1 |      1 |     98 |00:00:18.39 |    1820K|
|* 16 |            INDEX FULL SCAN              | LADUNGENPK              |      1 |     36 |  69157 |00:00:22.40 |    1808K|
|  17 |             SORT GROUP BY               |                         |  69296 |      1 |  69296 |00:00:00.86 |   22907 |
|  18 |              TABLE ACCESS BY INDEX ROWID| LADUNGEN                |  69296 |      8 |  30356 |00:00:00.45 |   22907 |
|* 19 |               INDEX RANGE SCAN          | LADSAMMELINDEX          |  69296 |      8 |  30356 |00:00:00.29 |   12827 |
|  20 |          TABLE ACCESS BY INDEX ROWID    | MATERIALSTAMM           |     98 |      1 |     98 |00:00:00.01 |     111 |
|* 21 |           INDEX UNIQUE SCAN             | MATERIALSTAMMPK         |     98 |      1 |     98 |00:00:00.01 |      13 |
|  22 |         TABLE ACCESS BY INDEX ROWID     | LVSCHARGE               |     98 |      1 |     97 |00:00:00.01 |     110 |
|* 23 |          INDEX UNIQUE SCAN              | LVSCHARGEPK             |     98 |      1 |     97 |00:00:00.01 |      13 |
|  24 |       TABLE ACCESS BY INDEX ROWID       | LVSKOMMAUFTRAGPOSLADUNG |     83 |  23250 |      0 |00:00:00.01 |      59 |
|* 25 |        INDEX RANGE SCAN                 | POSLADUNG_LADUNG_I      |     83 |      1 |      0 |00:00:00.01 |      59 |
|  26 |     VIEW PUSHED PREDICATE               | VW_NSO_1                |     83 |      1 |      0 |00:00:00.01 |       0 |
|* 27 |      FILTER                             |                         |     83 |        |      0 |00:00:00.01 |       0 |
|  28 |       NESTED LOOPS                      |                         |      0 |        |      0 |00:00:00.01 |       0 |
|  29 |        NESTED LOOPS                     |                         |      0 |      1 |      0 |00:00:00.01 |       0 |
|  30 |         TABLE ACCESS BY INDEX ROWID     | LADUNGEN                |      0 |      4 |      0 |00:00:00.01 |       0 |
|* 31 |          INDEX RANGE SCAN               | LADSAMMELINDEX          |      0 |      4 |      0 |00:00:00.01 |       0 |
|* 32 |         INDEX RANGE SCAN                | POSLADUNG_LADUNG_I      |      0 |      1 |      0 |00:00:00.01 |       0 |
|* 33 |        TABLE ACCESS BY INDEX ROWID      | LVSKOMMAUFTRAGPOSLADUNG |      0 |      1 |      0 |00:00:00.01 |       0 |
|  34 |    TABLE ACCESS BY INDEX ROWID          | PASXBATCHSTATUS         |     83 |      1 |     82 |00:00:00.01 |      90 |
|* 35 |     INDEX UNIQUE SCAN                   | PASXBATCHSTATUSUNIQUE   |     83 |      1 |     82 |00:00:00.01 |       8 |
|  36 |   SORT AGGREGATE                        |                         |      0 |      1 |      0 |00:00:00.01 |       0 |
|* 37 |    INDEX SKIP SCAN                      | FSRESERVIERUNGEN        |      0 |      1 |      0 |00:00:00.01 |       0 |
-----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("TSU"."SAMMELLADUNGNR"=:B1)
   4 - filter(DECODE("MAT"."CHARGENPFLICHTIG",22,21,DECODE(NVL("STA"."ALLOWEDFORPICKING",23),901,21,902,DEC
              ODE( (SELECT COUNT(*) FROM WLTUSER."FSRESERVIERUNGEN" "RS" WHERE
              "RS"."FSRS_CHARGENNR"=:B1),0,22,21),903,DECODE("LVS"."DEFAULTQSSTAT",NVL("CHA"."STATUSQS",23),21,22),22))=2
              1)
   7 - filter("KOM"."BEARBSTATUSKEY" IS NULL)
   9 - filter("LVSKOMM"."FUNCSEL_QARESERVATIONOK0"(14536,"LAD"."CHARGENR","CHA"."STATUSQS")=0 AND
              ("CHA"."VERFALLDATUM" IS NULL OR "CHA"."VERFALLDATUM">=SYSDATE@!))
  14 - access("LVS"."MATERIALNR"='117098')
  15 - filter("LAD"."MATERIALNR"='117098' AND "LAD"."LADUNGART"=1011 AND
              UPPER("LAD"."ZUGEORDNETFKTBEREICH") LIKE 'WAREHOUSE_G' AND "LAD"."INTRANSPORT"=22 AND
              "LAD"."INBEARBEITUNG"=22 AND "LAD"."INENTNAHME"=22 AND "LAD"."ININVENTUR"=22 AND
              "LAD"."ZUGEORDNETSTANDORT"="LVS"."STANDORTKEY")
  16 - filter("LVSKOMM"."FUNCSEL_BATCHRESERVATIONOK0"(14536,"LAD"."LADUNGNR")=0 AND  (SELECT
              COUNT(DISTINCT "TSU"."MATERIALNR") FROM WLTUSER."LADUNGEN" "TSU" WHERE "TSU"."SAMMELLADUNGNR"=:B1)<=1)
  19 - access("TSU"."SAMMELLADUNGNR"=:B1)
  21 - access("MAT"."MATERIALNR"='117098' AND "LAD"."ZUGEORDNETSTANDORT"="MAT"."STANDORTKEY")
  23 - access("CHA"."MATERIALNR"(+)='117098' AND "LAD"."CHARGENR"="CHA"."CHARGENR"(+))
  25 - access("LAD"."LADUNGNR"="KOM"."LADUNGNR"(+))
  27 - filter((-1)<>"LAD"."SAMMELLADUNGNR")
  31 - access("PAL"."SAMMELLADUNGNR"="LAD"."SAMMELLADUNGNR")
       filter("PAL"."SAMMELLADUNGNR"<>(-1) AND "PAL"."SAMMELLADUNGNR"<>0)
  32 - access("KLA"."LADUNGNR"="PAL"."LADUNGNR")
  33 - filter("KLA"."BEARBSTATUSKEY"=4603)
  35 - access("STA"."QSSTATUSKEY"(+)=NVL("CHA"."STATUSQS",23))
  37 - access("RS"."FSRS_CHARGENNR"=:B1)
       filter("RS"."FSRS_CHARGENNR"=:B1)

Editor's comments:
==================
Line 17 (SORT GROUP BY) also recorded
    ----------------------------
    |  OMem |  1Mem | Used-Mem |
    ----------------------------
    |  2048 |  2048 | 2048  (0)|
    ----------------------------

I acquired the plan structure from a calld to dbms_xplan.display_cursor()
I acquired the complete predicate from calls to explain plan and dbms_xplan.display()

You can’t help noticing that the plan is a lot more complex than you might expect – the query is running against a complex view whose definition, unfortunately, I forget to pick up from the client; however, here are a few notes about the view definition that appear in the execution plan:

  • Lines 1-2 represent a scalar subquery in the view’s select list.
  • Lines 4,36,37 represent an inline scalar subquery inside a decode() in the view’s where clause.
  • Lines 26-33 represents a NOT IN subquery that has been unnested into a non-mergeable view that has then been subject to predicate pushing so that it has become the second “table” in a nested loop ANTI-JOIN driven from line 6 of the plan.
  • Lines 17-19 represents a correlated scalar subquery in the view’s where clause.
  • Line 9 is where we can see one of the pl/sql functions of the main query being called for each row
  • Line 16 is where we can see the other pl/sql function being called

The shape of the plan around lines 15 to 19 may be a little confusing. There is a filter subquery in that part of the plan, but there is no FILTER operation visible in the plan itself. There are several cases where filter operations “disappear” and this can make it easy to mis-read plans, so I’m just going to show you an edited version of those lines to make the execution a little clearer.


------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------------------------------------
|* 15 |           TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |      1 |      1 |     98 |
|* 16a|            FILTER                       |                         |      1 |     36 |  69157 |
|  16b|             INDEX FULL SCAN             | LADUNGENPK              |      1 |     36 |  69296 |
|  17 |             SORT GROUP BY               |                         |  69296 |      1 |  69296 |
|  18 |              TABLE ACCESS BY INDEX ROWID| LADUNGEN                |  69296 |      8 |  30356 |
|* 19 |               INDEX RANGE SCAN          | LADSAMMELINDEX          |  69296 |      8 |  30356 |
------------------------------------------------------------------------------------------------------

  15 - filter("LAD"."MATERIALNR"='117098' AND "LAD"."LADUNGART"=1011 AND
              UPPER("LAD"."ZUGEORDNETFKTBEREICH") LIKE 'WAREHOUSE_G' AND "LAD"."INTRANSPORT"=22 AND
              "LAD"."INBEARBEITUNG"=22 AND "LAD"."INENTNAHME"=22 AND "LAD"."ININVENTUR"=22 AND
              "LAD"."ZUGEORDNETSTANDORT"="LVS"."STANDORTKEY")
  16a- filter(SELECT COUNT(DISTINCT "TSU"."MATERIALNR") FROM WLTUSER."LADUNGEN" "TSU" WHERE "TSU"."SAMMELLADUNGNR"=:B1)<=1)
  16b- filter("LVSKOMM"."FUNCSEL_BATCHRESERVATIONOK0"(14536,"LAD"."LADUNGNR")=0

In this edited extract we see that Oracle has done a full scan (not a fast full scan) of index LADUNGENPK at line 16b and, for each index entry, called one of our pl/sql functions; any row that survives the function call is then used in the filter at line 16a to drive the aggregate subquery (starting at line 17) to determine whether or not the entry should be used to access the table.

Since I got the execution plan statistics by adding the hint /*+ gather_plan_statistics */ to the query we can’t be completely confident in the accuracy of the time reported – it’s based on sampling – but we do see a lot of time spent on these lines, and an index full scan on a table with thousands of rows generally looks like a bad idea, and calling a pl/sql function thousands of time looks like a bad idea, and running an aggregate subquery thousands of times looks like a bad idea – especially since we end up acquiring only 98 rows from the table: so at first sight we might have our attention drawn to this part of the plan. But before diving in too deeply, let’s look at the good execution plan:

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name                    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                         |      1 |        |     61 |00:00:00.11 |    9661 |       |       |          |
|   1 |  SORT AGGREGATE                       |                         |     61 |      1 |     61 |00:00:00.01 |     125 |       |       |          |
|   2 |   TABLE ACCESS BY INDEX ROWID         | LADUNGEN                |     61 |      8 |    266 |00:00:00.01 |     125 |       |       |          |
|*  3 |    INDEX RANGE SCAN                   | LADSAMMELINDEX          |     61 |      8 |    266 |00:00:00.01 |      51 |       |       |          |
|   4 |  SORT ORDER BY                        |                         |      1 |      1 |     61 |00:00:00.11 |    9661 | 24576 | 24576 |22528  (0)|
|*  5 |   FILTER                              |                         |      1 |        |     61 |00:00:00.05 |    9536 |       |       |          |
|   6 |    NESTED LOOPS ANTI                  |                         |      1 |      1 |     83 |00:00:00.17 |    9381 |       |       |          |
|*  7 |     FILTER                            |                         |      1 |        |     83 |00:00:00.17 |    9381 |       |       |          |
|   8 |      NESTED LOOPS OUTER               |                         |      1 |      1 |     83 |00:00:00.01 |    9381 |       |       |          |
|   9 |       NESTED LOOPS OUTER              |                         |      1 |      1 |     83 |00:00:00.17 |    9325 |       |       |          |
|* 10 |        FILTER                         |                         |      1 |        |     83 |00:00:00.17 |    9239 |       |       |          |
|  11 |         NESTED LOOPS OUTER            |                         |      1 |      1 |     98 |00:00:00.01 |    9045 |       |       |          |
|  12 |          NESTED LOOPS                 |                         |      1 |      1 |     98 |00:00:00.05 |    8939 |       |       |          |
|  13 |           NESTED LOOPS                |                         |      1 |      1 |     98 |00:00:00.05 |    8832 |       |       |          |
|  14 |            TABLE ACCESS BY INDEX ROWID| LVSMATERIALSTAMM        |      1 |      1 |      1 |00:00:00.01 |       3 |       |       |          |
|* 15 |             INDEX UNIQUE SCAN         | LVSMATERIALSTAMMPK      |      1 |      1 |      1 |00:00:00.01 |       2 |       |       |          |
|* 16 |            TABLE ACCESS BY INDEX ROWID| LADUNGEN                |      1 |      1 |     98 |00:00:00.05 |    8829 |       |       |          |
|* 17 |             INDEX RANGE SCAN          | LADMATCHARGEINDEX       |      1 |     46 |   1137 |00:00:00.01 |       5 |       |       |          |
|  18 |           TABLE ACCESS BY INDEX ROWID | MATERIALSTAMM           |     98 |      1 |     98 |00:00:00.01 |     107 |       |       |          |
|* 19 |            INDEX UNIQUE SCAN          | MATERIALSTAMMPK         |     98 |      1 |     98 |00:00:00.01 |       9 |       |       |          |
|  20 |          TABLE ACCESS BY INDEX ROWID  | LVSCHARGE               |     98 |      1 |     97 |00:00:00.01 |     106 |       |       |          |
|* 21 |           INDEX UNIQUE SCAN           | LVSCHARGEPK             |     98 |      1 |     97 |00:00:00.01 |       9 |       |       |          |
|  22 |        TABLE ACCESS BY INDEX ROWID    | PASXBATCHSTATUS         |     83 |      1 |     82 |00:00:00.01 |      86 |       |       |          |
|* 23 |         INDEX UNIQUE SCAN             | PASXBATCHSTATUSUNIQUE   |     83 |      1 |     82 |00:00:00.01 |       4 |       |       |          |
|  24 |       TABLE ACCESS BY INDEX ROWID     | LVSKOMMAUFTRAGPOSLADUNG |     83 |  23244 |      0 |00:00:00.01 |      56 |       |       |          |
|* 25 |        INDEX RANGE SCAN               | POSLADUNG_LADUNG_I      |     83 |      1 |      0 |00:00:00.01 |      56 |       |       |          |
|  26 |     VIEW PUSHED PREDICATE             | VW_NSO_1                |     83 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 27 |      FILTER                           |                         |     83 |        |      0 |00:00:00.01 |       0 |       |       |          |
|  28 |       NESTED LOOPS                    |                         |      0 |        |      0 |00:00:00.01 |       0 |       |       |          |
|  29 |        NESTED LOOPS                   |                         |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|  30 |         TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |      0 |      8 |      0 |00:00:00.01 |       0 |       |       |          |
|* 31 |          INDEX RANGE SCAN             | LADSAMMELINDEX          |      0 |      8 |      0 |00:00:00.01 |       0 |       |       |          |
|* 32 |         INDEX RANGE SCAN              | POSLADUNG_LADUNG_I      |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 33 |        TABLE ACCESS BY INDEX ROWID    | LVSKOMMAUFTRAGPOSLADUNG |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|  34 |    SORT GROUP BY                      |                         |     83 |      1 |     83 |00:00:00.01 |     155 |  2048 |  2048 |          |
|  35 |     TABLE ACCESS BY INDEX ROWID       | LADUNGEN                |     83 |      8 |    377 |00:00:00.01 |     155 |       |       |          |
|* 36 |      INDEX RANGE SCAN                 | LADSAMMELINDEX          |     83 |      8 |    377 |00:00:00.01 |      65 |       |       |          |
|  37 |    SORT AGGREGATE                     |                         |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
|* 38 |     INDEX SKIP SCAN                   | FSRESERVIERUNGEN        |      0 |      1 |      0 |00:00:00.01 |       0 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------

The difference that leaps off the page when you view this plan is that the big numbers (69,000) in the starts column have disappeared. Tracking through the detail we can see that the critical scalar subquery has been postponed almost to the last step (lines 34 – 36) and has been executed only 83 times. Tracking back up the plan to see why, we get back to the first steps of physical execution – lines 13 – 17 in the new plan (12 – 16 in the old plan). In both cases the initial steps of the plan form a nested loop between LVSMATERIALSTAMM and LADUNGEN. Here’s the comparison:

Bad Plan
------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------------------------------------------
|  12 |          NESTED LOOPS                   |                         |      1 |      1 |     98 |
|  13 |           TABLE ACCESS BY INDEX ROWID   | LVSMATERIALSTAMM        |      1 |      1 |      1 |
|* 14 |            INDEX UNIQUE SCAN            | LVSMATERIALSTAMMPK      |      1 |      1 |      1 |
|* 15 |           TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |      1 |      1 |     98 |
|* 16 |            INDEX FULL SCAN              | LADUNGENPK              |      1 |     36 |  69157 |
------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name                    | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------------------------------
|  13 |           NESTED LOOPS                |                         |      1 |      1 |     98 |
|  14 |            TABLE ACCESS BY INDEX ROWID| LVSMATERIALSTAMM        |      1 |      1 |      1 |
|* 15 |             INDEX UNIQUE SCAN         | LVSMATERIALSTAMMPK      |      1 |      1 |      1 |
|* 16 |            TABLE ACCESS BY INDEX ROWID| LADUNGEN                |      1 |      1 |     98 |
|* 17 |             INDEX RANGE SCAN          | LADMATCHARGEINDEX       |      1 |     46 |   1137 |
----------------------------------------------------------------------------------------------------

It’s the same nested loop starting with the same (guaranteed) single row from LVSMATERIALSTAMM. When we have histograms the optimizer picks a full scan on the primary key index LADUNGENPK predicting 36 index entries and getting 69,000, but when we don’t have histograms the optimizer picks a range scan on index LADMATCHARGEINDEX predicting 46 index entries and getting 1,137. The bad choice of index by itself wouldn’t be a disaster – except Oracle ends up calling a pl/sql function thousands of times that executes another SQL statement, and then runs an aggregate subquery thousands of times for anything that survives the function call.

A little detail that you might spot further up the plan, by the way, is that when we drive off the primary key index (bad plan) we DON’T end up with a SORT ORDER BY operation in the plan (see line 4 of the good plan). There is an ORDER BY clause in the original query that is satisfied by the full scan on LADUNGENPK and maintained by the optimizer’s choice of all the subsequent nested loops; when we access the LADUNGEN data through the LADMATCHARGEINDEX we end up needing to sort it.

So there are two questions that we need to answer: how do we make Oracle pick the right index, and why does Oracle choose the wrong index when the histogram is in place. The answer to the second question is the easier one – when we have a histogram Oracle’s estimate of the rows returned by the range scan on LADMATCHARGEINDEX changes from 46 (the average number of rows per MATERIALNR) to 1,137 (the actual number of rows for MATERIALNR=’117098′, because that value happens to be a popular value. With the larger estimate (up by a factor of 25) the increased cardinality and costs cascade through the plan and the final cost of the plan is much higher than the cost of the accepted plan. In this example, better information has resulted in a worse plan.

The Buggy bit

The (nominally correct) increase, though, is only part of the problem. There’s another, simultaneous, issue that causes the bad plan to appear as the quality of information improves, and I can best demontrate this by looking at some partial execution plans, starting with an extract from the bad plan when I run it through explain plan.


--------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                    | Rows  | Bytes | Cost (%CPU)|
--------------------------------------------------------------------------------------------------------
|* 15 |           TABLE ACCESS BY INDEX ROWID   | LADUNGEN                |     1 |   234 |   258   (2)|
|* 16 |            INDEX FULL SCAN              | LADUNGENPK              |    36 |       |   251   (2)|
|  17 |             SORT GROUP BY               |                         |     1 |    12 |            |
|  18 |              TABLE ACCESS BY INDEX ROWID| LADUNGEN                |     8 |    96 |     3   (0)|
|* 19 |               INDEX RANGE SCAN          | LADSAMMELINDEX          |     8 |       |     1   (0)|
--------------------------------------------------------------------------------------------------------

What I’d like you to note is that the full scan of LADUNGENPK, with the inclusion of the function call for each index entry, and the calls to the subquery, followed by the (predicted) 36 visits to the table has a total cost of 258 (Line 15). Let’s break this down, first by checking the predicted cost of just the index full scan, then the predicated cost of the index full scan with the function calls.


SQL> set autotrace traceonly explain
SQL> select /*+ index(ladungen) */ ladungnr from wltuser.ladungen;

Execution Plan
----------------------------------------------------------
Plan hash value: 3373915778

-------------------------------------------------------------------------------
| Id  | Operation        | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |            | 71590 |   489K|   249   (1)| 00:00:04 |
|   1 |  INDEX FULL SCAN | LADUNGENPK | 71590 |   489K|   249   (1)| 00:00:04 |
-------------------------------------------------------------------------------

SQL> select /*+ index(ladungen) */ ladungnr from wltuser.ladungen
  2* where wltuser."LVSKOMM"."FUNCSEL_BATCHRESERVATIONOK0"(14536,"LADUNGNR")=0

Execution Plan
----------------------------------------------------------
Plan hash value: 3373915778

-------------------------------------------------------------------------------
| Id  | Operation        | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |            |   716 |  5012 |   277  (11)| 00:00:04 |
|*  1 |  INDEX FULL SCAN | LADUNGENPK |   716 |  5012 |   277  (11)| 00:00:04 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("LVSKOMM"."FUNCSEL_BATCHRESERVATIONOK0"(14536,"LADUNGNR")=0)

From the first plan we see that the optimizer knows that there are 71,590 entries in the index, and a full scan costs 249.
From the second plan we see that the incremental cost of calling the function 71,590 times is 28, and Oracle’s selectivity for the function call is one of its “guesses” – in this case 1%.

But from the earlier extract we see the cardinality prediction for the index is 36 entries – which is 5% of 716, and that’s because Oracle has a standard guess of 5% for certain types of subqueries (particularly subqueries with range predicates), and our use of the subquery comes down to:

    (SELECT COUNT(DISTINCT "TSU"."MATERIALNR") FROM WLTUSER."LADUNGEN" "TSU" WHERE "TSU"."SAMMELLADUNGNR"=:B1)<=1

More dramatically, from the extract we see that the total cost of the index full scan, function calls, and subquery executions is 251 – an increment of only 2 over the cost of the basic index full scan. Here’s an interesting coincidence: 5% (subquery selectivity) of 28 (cost of 71,590 function calls) = 2 (when rounded up).

I think that Oracle has applied the 5% subquery factor to the number of rows in the index before applying the cost of calling the function, and then it hasn’t added ANY cost to the full scan to allow for the number of times that it will have to call the subquery. The entire calculation strategy is broken and, as a result, Oracle underestimates the cost of this part of the plan by a huge amount – and that makes the “correct” cost of using the primary key index appear to be the worst choice.

By this time I had already spent about an hour picking the problem apart – so I suggested that as a first (easy) step we should modify the code to tell the optimizer to run the subquery late (as it did in the good plan). I was fairly confident that this would improve things and, given the incorrect application of the 5% selectivity, might result in a switch back to the correct index.

This is where the vendor’s design strategy came to our aid. The code embedded in the application was generally constructed dynamically but very simple. Most of the complexity was in stored views. So I recreated the view VLADUNGEN_FREI with the hint /*+ no_push_subq */ embedde – and luckily the view definition allowed me to put that hint into precisely the subquery where it was needed. The effect was exactly what we wanted to see; with the delayed appearance of the subquery the COST of the wrong index made the plan much more expensive and the number of times the function and the subquery were called at run_time dropped dramatically, so the query dropped back to two-second time to completion. (There was still a slight variation in the subsequent order of tables thanks to slight variations in the arithmetic derived for the histograms, though.)

Conclusion

I didn’t want to spend any more of the client’s time on this problem – we were, after all, only trying to determine why dropping histograms had made such a difference to some of their queries. This example demonstrated an important principle, but the client had got rid of the histograms in production anyway.

The underlying problem had two components. The presence of a histogram had given Oracle better information about the data and had therefore increased the cost of the ideal execution plan; this would not have mattered if there had not been some defects in the handling of function calls and subquery costs for filters against indexes which made an alternative plan with a different index produce a lower cost estimate despite consuming far more resources at run-time.

If we had not been dropping the histograms anyway, we would have been able to work around the problem because the vendor embedded most of the code complexity in views – so we could add hints to the view definitions without having to wait for the vendor to modify the front-end code, and without having to do massive regression tests to prove the our changes still produced the same result sets.

10 Comments »

  1. As an aside Jonathan I’m interested in your comment

    ” we would have been able to work around the problem because the vendor embedded most of the code complexity in views – so we could add hints to the view definitions ”

    I think I’d be a bit worried that the hinted view might well make *this* bit of generated code run well (or at least for problem values) but that the view might be in use in multiple places with different semantics and that a hint that benefited us here would be detrimental elsewhere. I wonder how you might triage this potential side effect?

    Comment by nlitchfield — August 9, 2012 @ 9:31 pm BST Aug 9,2012 | Reply

    • Niall,

      That would have been tough.
      I think in this particular case, and with this particular hint, there’s a fairly good chance we would have been safe. The /*+ no_push_subq */ hint falls into the class that I call “strategic” hints – it’s not what I call micro-management, and that tends to make it a safer hint to use in these circumstances. (I’d be very unhappy about using a hint like “index(tableX, indexY)” because of the high probability that Oracle would use different join orders in different uses of the view – then use the index as directed with catastrophic results.)

      Historically the optimizer always used to leave subqueries to the last possible moment in an execution plan, so giving it an explicit hint to do so is something was fairly likely to recreate the pre-upgrade execution plan, and fairly likely to do “not much” damage if it wasn’t the ideal solution in every case.

      You’re right to raise the point though.

      In this case we could have been in serious trouble if dropping the histograms hadn’t worked. If the application had been using bind variables in all the appropriate places we could have generated an SQL Plan baseline for this query – but it was using literals, and there’s no way (at present) to “force match” a baseline to a query in the way that you can with profiles. (Possibly we could have used Kerry’s mechanism for cheating and forcing a complete sets of hints into the system as a “force match” profile – but I wouldn’t want to do that on a production system).

      We could have investigated the effect of setting the optimizer_mode back to the earlier version to see how many things got fixed and how many new things broke. We might have tried running with cursor_sharing = force so that we could use attach an SQL Plan baseline to the query with bind variable substitution.

      We might simply have got lucky and found that this particular view had been created for this particular piece of front-end code – but we would certainly have had to search the entire applicatin for every references to the view and tested for side effects.

      Comment by Jonathan Lewis — August 9, 2012 @ 10:03 pm BST Aug 9,2012 | Reply

      • “Historically the optimizer always used to leave subqueries to the last possible moment in an execution plan”

        Did this change with Oracle 11 ?

        Comment by Todor Botev — August 12, 2012 @ 6:44 am BST Aug 12,2012 | Reply

        • Todor,

          Sorry about the delay answering – I’ve done 3 US, two conferences, and two seminars in the week.
          The core of the change occurred in 10g with cost-based query transformation.

          Comment by Jonathan Lewis — August 19, 2012 @ 4:09 pm BST Aug 19,2012

  2. Simply: thanks a lot for this elegant and structured strategy for solving a performance problem.

    Just one supplementary question: when I see the presence of INDEX FULL SCAN coupled with the absence of a SORT OPRDER BY operation (while the query contains an ORDER BY) I immediately think about the optimizer being in a FIRST_ROWS mode. Although I don’t see in the query what could make the CBO jumping to a FIRST_ROWS mode, have you investigated this option before?

    Comment by hourim — August 10, 2012 @ 7:26 am BST Aug 10,2012 | Reply

    • Mohamed,

      Thanks for the compliment.

      Your comment is very useful – it’s very easy to overlook a critical detail when trying to work out why an execution plan is so bad and your suggestion that “index full scan” and a missing “sort order by” is an excellent candidate for being on a “quick check” list before you start doing anything complex to analyze the problem. It’s also an important indication of how important it is to get a current execution plan from memory rather than relying on “explain plan”.

      Part of the analysis I haven’t shown is the “Outline” content for this plan. This would show any local settings for optimizer parameters as opt_param() hints, but also show things like optimizer_features_enable() and (most importantly from your perspective) any hints like first_rows(1).

      Comment by Jonathan Lewis — August 10, 2012 @ 11:34 am BST Aug 10,2012 | Reply

  3. You say “I think that Oracle has applied the 5% subquery factor to the number of rows in the index before applying the cost of calling the function, and then it hasn’t added ANY cost to the full scan to allow for the number of times that it will have to call the subquery. ”

    I have found that Oracle never adds any additional cost for user-defined PL/SQL functions and this can lead to really sub-optimal plans because a condition like “myfunction( mColumn ) = 0″ always has a default selectivity of 5% and an additional cost of 0, even if it involves complex subqueries .
    Did you try to use ASSOCIATE STATISTICS to inform the CBO of the true cost of the function, or do you think that bugs like 12332294 make the “ASSOCIATE STATISTICS ” command useless?

    Regards,
    Andrea

    Comment by Andrea Monti — August 13, 2012 @ 10:22 am BST Aug 13,2012 | Reply

    • Andrea,

      I am a little surprised by your comment that: “Oracle never adds any additional cost for user-defined PL/SQL functions”, since my example seems to show quite clearly that Oracle has added a cost in this case.

      Do you have a simple demonstration that I can use to recreate your observation – and have you tested it on the latest version of Oracle ?

      Looking through my notes I find that I haven’t done anything with “associate statistics” since 2006, so I’m not in a position to make any comments about what that might do, and I haven’t come across but 12332294 before. You question did make me wonder, though, whether your comment about “no cost” applied to operators rather than functions.

      Comment by Jonathan Lewis — August 19, 2012 @ 4:16 pm BST Aug 19,2012 | Reply

  4. [...] noting that Jonathan Lewis was very successful in tuning a customer’s query by simply deleting histograms: The most [...]

    Pingback by Delete Column Histograms to Improve SQL Plan Stability » Eddie Awad's Blog — August 13, 2012 @ 9:02 pm BST Aug 13,2012 | Reply

  5. “The most critical piece of advice I had given them (after listening very carefully to their description of the system) was to get rid of ALL the histograms they had on their system, and then watch very carefully for any signs that they might need to re-introduce a handful of histograms over the next few weeks.”

    I too find this to be the correct strategy for applying histograms to a system – start with none and look whether there are system parts that need some. Is it a general approach you advise your customers? Or was there any specific bit of information (e.g. using literals in the predicates) that made you think this would make the system perform better?

    Comment by Todor Botev — August 23, 2012 @ 8:18 am BST Aug 23,2012 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 4,015 other followers