Oracle Scratchpad

December 21, 2015

Parallel Plans

Filed under: 12c,Oracle,Parallel Execution — Jonathan Lewis @ 8:40 am BST Dec 21,2015

This article was prompted by a pair of articles by Yasin Baskan of Oracle Corporation: PX Server Sets etc. and Multiple Parallelizers, plus a little extra prompting from a mistake that I made when reading the second of those two articles. The fact that I made a mistake is significant because, without it, I wouldn’t have created a model to check Yasin’s description of the parallel activity.

I want to examine the following query to find out the order of activity:


select
        distinct t1.n1
from
        t1
where 
        t1.n1 >= 500
and     t1.n2 > (select avg(t2.n2) from t2)
and     t1.n3 > (select avg(t3.n3) from t3)
;

I’m going to be using 12.1.0.2 for this demonstration (lots of changes appear in parallel execution as you move from 11g to 12c, so the version is most significant), with all default settings for the parallel-related parameters.

To get things started, here’s the code to generate the data that I used, and the serial execution plan:


create table t1 nologging as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1000)        n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4
;

create table t2 nologging as
with generator as ( 
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4
;

create table t3 nologging as
with generator as ( 
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4 ;

-- don't need to gather stats as they are generated on CTAS in 12c.
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   150 (100)|          |
|   1 |  HASH UNIQUE         |      |    24 |   336 |   150   (4)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T1   |    25 |   350 |    51   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 20000 |    97K|    49   (3)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T3   | 20000 |    97K|    49   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">=500 AND "T1"."N2"> AND "T1"."N3">))

I’ve pulled the execution plan from memory using dbms_xplan.display_cursor(), which is why the predicate section shows the odd comparison of “t2.n2 > {nothing there}” and “t3.n3 > {nothing there}”. The predicate section produced by a simple call to explain plan would help us to fill in the details:


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."N1">=500 AND "T1"."N2"> (SELECT AVG("T2"."N2") FROM
              "T2" "T2") AND "T1"."N3"> (SELECT AVG("T3"."N3") FROM "T3" "T3"))

We can infer from the predicate section that Oracle executes the t2 subquery first and the t3 subquery second; but there is a slight oddity in the shape of the execution plan if we apply the “first child first / recursive descent” approach to reading the plan. It looks as if operations 3 and 5 are children of line 2 and should be run before child 2. This is the consequence of “subquery pushing”. In fact the subqueries are both filter subqueries and, in principle (and ignoring scalar subquery caching for the moment), they will be run once for each row produced from the tablescan of operation 2.

The activity is as follows:

  • start scanning table t1
  • when the first row appears where n1 >= 500 run the subquery against t2 to derive the average of t2.n2 (10000.5) and check to see if the n2 value of the current t1 row exceeds that value (it doesn’t);
  • continue scanning t1 checking for rows where n1 >= 500 and n2 >= 10000.5 (no need to rerun the subquery thanks to scalar subquery caching),
  • on finding the first row matching both predicates run the subquery against t3 to derive the average of t3.n3 (10000.5) and check to see if the n3 value of the current t1 row exceeds that value.
  • continue the scan of t1 checking for rows where n1 >= 500 and n2 >= 10000.5 and n3 >= 10000.5 (again scalar subquery caching means no further executions of either subquery).

We can add weight to the argument that this is the order of activity by flushing the buffer cache and enabled the extended trace before running the query.  Here’s a section of a trace file that demonstrates the point – with no edits beyond the insertion of a few blank lines:


PARSE #140345652175312:c=1000,e=1813,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=115251305,tim=134216063372
EXEC #140345652175312:c=0,e=62,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=115251305,tim=134216063544
WAIT #140345652175312: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=134216063569
WAIT #140345652175312: nam='db file sequential read' ela= 9 file#=5 block#=131 blocks=1 obj#=133974 tim=134216063627
WAIT #140345652175312: nam='db file scattered read' ela= 315 file#=5 block#=132 blocks=112 obj#=133974 tim=134216064187

WAIT #140345652175312: nam='db file sequential read' ela= 9 file#=5 block#=515 blocks=1 obj#=133975 tim=134216064438
WAIT #140345652175312: nam='db file scattered read' ela= 115 file#=5 block#=516 blocks=112 obj#=133975 tim=134216064732
WAIT #140345652175312: nam='db file scattered read' ela= 18 file#=5 block#=628 blocks=12 obj#=133975 tim=134216066046
WAIT #140345652175312: nam='db file scattered read' ela= 123 file#=5 block#=642 blocks=112 obj#=133975 tim=134216066548
WAIT #140345652175312: nam='db file scattered read' ela= 20 file#=5 block#=754 blocks=14 obj#=133975 tim=134216067760
WAIT #140345652175312: nam='db file scattered read' ela= 124 file#=5 block#=770 blocks=112 obj#=133975 tim=134216068153
WAIT #140345652175312: nam='db file scattered read' ela= 8 file#=5 block#=882 blocks=2 obj#=133975 tim=134216069677

WAIT #140345652175312: nam='db file scattered read' ela= 18 file#=5 block#=244 blocks=12 obj#=133974 tim=134216070049
WAIT #140345652175312: nam='db file scattered read' ela= 127 file#=5 block#=258 blocks=112 obj#=133974 tim=134216070382

WAIT #140345652175312: nam='db file sequential read' ela= 6 file#=5 block#=899 blocks=1 obj#=133976 tim=134216070812
WAIT #140345652175312: nam='db file scattered read' ela= 41 file#=5 block#=900 blocks=112 obj#=133976 tim=134216071028
WAIT #140345652175312: nam='db file scattered read' ela= 42 file#=5 block#=1012 blocks=12 obj#=133976 tim=134216072956
WAIT #140345652175312: nam='db file scattered read' ela= 215 file#=5 block#=1026 blocks=112 obj#=133976 tim=134216073642
WAIT #140345652175312: nam='db file scattered read' ela= 29 file#=5 block#=1138 blocks=14 obj#=133976 tim=134216076334
WAIT #140345652175312: nam='db file scattered read' ela= 154 file#=5 block#=1154 blocks=112 obj#=133976 tim=134216076825
WAIT #140345652175312: nam='db file scattered read' ela= 14 file#=5 block#=1266 blocks=2 obj#=133976 tim=134216081673

WAIT #140345652175312: nam='db file scattered read' ela= 24 file#=5 block#=370 blocks=14 obj#=133974 tim=134216082815
WAIT #140345652175312: nam='db file scattered read' ela= 144 file#=5 block#=386 blocks=112 obj#=133974 tim=134216083232
WAIT #140345652175312: nam='db file scattered read' ela= 20 file#=5 block#=498 blocks=12 obj#=133974 tim=134216084494

FETCH #140345652175312:c=16998,e=21096,p=1105,cr=1111,cu=0,mis=0,r=1,dep=0,og=1,plh=115251305,tim=134216084683

Object 133974 is table t1, object 133975 is table t2, and object 133976 is table t3.

As you can see, we start by scanning t1, then we scan t2 once, then we scan more of t1, then we scan t3 once, then we finish off scanning t1.

We could chase the detail a little further, of course – we could check the session statistics to make sure that we haven’t been using the buffer cache to scan t2 and t3 repeatedly, or we could enable rowsource execution statistics to report the number of times each operation started, or we could check the STAT# lines dumped a little later in the trace files and infer that the t2 and t3 tablescans happened just once each.

At this point I’ll mention my mistake: when I originally read Yasin’s version of the parallel equivalent of this plan my first thought was that the subquery sections of the plan operate from the bottom upwards (i.e. another one of those examples where “first child first” doesn’t apply) whereas Yasin described them as running top down. I was wrong – I was thinking of the case where the optimizer analyses the query from the bottom up, but it still produces a plan that then runs from the top down – but I didn’t realize my mistake until after I’d been poking around with the parallel version of the query and had made a surprising discovery.

Parallel Execution

So what happens when we manipulate this query into parallelism ? To make it a little easier to see some of the details I’ve added hints to each query block to use a different degree of parallelism; and then, because it all happens so fast, I’ve introduced a function call to slow down the progress of the query. Here’s the function I’ve created (you’ll find it elsewhere on blog) and the final query I started testing with:


create or replace function wait_row (
        i_secs  number,
        i_return        number
) return number
parallel_enable
is
begin
        dbms_lock.sleep(i_secs);
        return i_return;
end;
/


select
        /*+ parallel(t1,5) tracking */ 
        distinct t1.n1
from
        t1
where   
        wait_row(0.01,t1.n2) > (select /*+ parallel(t2,3) */ avg(wait_row(0.01,t2.n2)) from t2)
and     wait_row(0.01,t1.n3) > (select /*+ parallel(t3,2) */ avg(wait_row(0.01,t3.n3)) from t3)
and     t1.n1 >= 1000
--      and     wait_row(0.01, t1.n1) >= 500
;

The wait_row() function simply waits for the number of seconds given in the first parameter then returns the (numeric) value of the second parameter. It calls the dbms_lock.sleep() function which has a granularity of 1/100th of a second, and since I’m planning to use it in a parallel query I’ve included the parallel_enable declaration.

Since the function has not been declared deterministic, or assigned to the pl/sql result cache, and since there are 20,000 rows in the t2 table the time to run the t2 subquery will be roughly 20,000 * 0.01 seconds, which (ignoring parallelism) would be about 200 seconds. I’ve actually hinted DOP = 3 for that subquery so I hope it will complete in about 70 seconds, and the subquery against t3 with hinted DOP = 2 should complete in about 100 seconds.

You’ll notice I have two possible predicates against t1 – the one shown uncommented here doesn’t use call the wait_row() function, the other does. Here’s an interesting point, though, about the example that doesn’t call wait_row() – the simple predicate is going to eliminate ALL the data from t1, there are no rows where n1 >= 1000 and that’s an important point to remember.

Here’s the execution plan:


------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       |    57 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)          | :TQ30001 |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | P->S | QC (RAND)  |
|   3 |    HASH UNIQUE                 |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | PCWP |            |
|   4 |     PX RECEIVE                 |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | PCWP |            |
|   5 |      PX SEND HASH              | :TQ30000 |     1 |    14 |    57   (4)| 00:00:01 |  Q3,00 | P->P | HASH       |
|   6 |       HASH UNIQUE              |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR       |          |     1 |    14 |    11   (0)| 00:00:01 |  Q3,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL      | T1       |     1 |    14 |    11   (0)| 00:00:01 |  Q3,00 | PCWP |            |
|   9 |          SORT AGGREGATE        |          |     1 |     5 |            |          |  Q3,00 | PCWP |            |
|  10 |           PX COORDINATOR       |          |       |       |            |          |        |      |            |
|  11 |            PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|  12 |             SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|  13 |              PX BLOCK ITERATOR |          | 20000 |    97K|    18   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|* 14 |               TABLE ACCESS FULL| T2       | 20000 |    97K|    18   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  15 |          SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|  16 |           PX COORDINATOR       |          |       |       |            |          |        |      |            |
|  17 |            PX SEND QC (RANDOM) | :TQ20000 |     1 |     5 |            |          |  Q2,00 | P->S | QC (RAND)  |
|  18 |             SORT AGGREGATE     |          |     1 |     5 |            |          |  Q2,00 | PCWP |            |
|  19 |              PX BLOCK ITERATOR |          | 20000 |    97K|    27   (0)| 00:00:01 |  Q2,00 | PCWC |            |
|* 20 |               TABLE ACCESS FULL| T3       | 20000 |    97K|    27   (0)| 00:00:01 |  Q2,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N1">=1000 AND "WAIT_ROW"(.01,"T1"."N2")> AND "WAIT_ROW"(.01,"T1"."N3")>))
  14 - access(:Z>=:Z AND :Z<=:Z) 20 - access(:Z>=:Z AND :Z<=:Z)

The odd “:Z>=:Z AND :Z <=:Z” predicate is an artefact of rowid-range scans in parallel tablescans, and we’ve seen the “disappearing subquery” problem with predicates pulled from memory so that’s ignorable. The interesting point is that the “T1.N1 >= 1000” predicate will eliminate all the data from t1 very quickly – so why did it take roughly 3 minutes and 30 seconds to run the query ?

Quick Answer – Oracle ran the two subqueries before it started the t1 tablescan !

Slow Answer – I need to start by filling in a lot of background information.

Notice how we have multiple DFO Trees in this execution plan. In the TQ column these are captured as (Q1,xx) (Q2,xx) and (Q3,xx).  In previous blogs I’ve explained how you can follow the order of activity by following the table queues (effectively the “xx” – also known as virtual tables, also known as Data Flow Operations) in order, but I haven’t explained how you deal with multiple DFO trees. It would be nice to think that DFO Trees implicitly told you the order of operation – unfortunately that’s not true (at least in current versions of Oracle), and this query demonstrates the point.

Table Queues, DFOs (data flow operations), and DFO Trees

The first of Yasin Baskan’s two articles describes the terminology to use when describing the movement of data through a parallel execution plan, but for the sake of familiarity I’ll walk through the plan above using the appropriate terms.  (Note: once upon a time the term “DFO tree” didn’t exist and the term DFO was used to describe what is now called a DFO tree, and a number of articles I’ve written in the past echo this out-dated usage).

Lines 1 – 8 show a DFO tree consisting of two DFOs – the tree is labelled as Q3. The first DFO in the tree (DFO 0) is described in lines 5 – 8 where one set of parallel execution slaves scans real table t1 and sends a data set to the other set parallel execution slaves by writing into a virtual table (:TQ30000); the second DFO in the tree (DFO 1) is described in lines 1 – 4 where a set of parallel execution slaves reads from virtual table :TQ30000, does some work and sends a data set to the query co-ordinator by writing to virtual table :TQ30001.

Lines 9 – 14 show a second DFO tree consisting of a single DFO – this tree is labelled Q1;

Lines 15 – 20 show us a third DFO tree, labelled Q2, also consisting of a single DFO.

Essentially (as described by Yasin Baskan) the focal point of a single DFO (data flow operation) is a PX SEND, and the top of a DFO tree is a PX Coordinator.

The key to performance problems and interpreting tricky parallel execution plans is the fact that in the extreme case every DFO tree may operate at a different degree of parallelism, and all the DFO trees may be operating concurrently – an unlucky execution plan may acquire far more parallel execution slaves than you expect.

So what’s the order of execution in this query, how do we find out, how many parallel query slaves will we start (and stop) as the query runs, and what’s the maximum number of parallel query slaves that we will be holding concurrently ? There are lots of places you could look:

  • v$pq_tqstat might give you some idea of processes used and data moved – after you’ve run the query, but it’s a bit broken for multiple DFO trees.
  • v$px_sesstat will give you some information about the work done by currently live slave processes if you can query it while your query is still running (“set pause on” may help).
  • v$px_session joined to v$session_event will give you some information about the time lost to waits for currently live slave processes if you can query them while your query is still running (it would be nice if Oracle were to introduce a “v$px_sesevent”)
  • The SQL Monitor screen in OEM (or the dbms_sql_tune.report_sql_monitor() procedure is fantastic (but a little broken) if you’ve paid the licences for the diagnostic and performance packs.

Here’s part of the text output from a call to dbms_sql_tune.report_sql_monitor() for a slightly different query – one that doesn’t include any predicate at all on the t1 table:


SQL Plan Monitoring Details (Plan Hash Value=3828285674)
==================================================================================================================================================
| Id |            Operation             |   Name   |  Rows   | Cost |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                                  |          | (Estim) |      | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT                 |          |         |      |         1 |   +256 |    11 |     1000 |       |          |                 |
|  1 |   PX COORDINATOR                 |          |         |      |         1 |   +256 |    11 |     1000 |       |          |                 |
|  2 |    PX SEND QC (RANDOM)           | :TQ30001 |      49 |   60 |         1 |   +256 |     5 |     1000 |       |          |                 |
|  3 |     HASH UNIQUE                  |          |      49 |   60 |         1 |   +256 |     5 |     1000 |    7M |          |                 |
|  4 |      PX RECEIVE                  |          |      49 |   60 |         1 |   +256 |     5 |     4506 |       |          |                 |
|  5 |       PX SEND HASH               | :TQ30000 |      49 |   60 |         1 |   +256 |     5 |     4506 |       |          |                 |
|  6 |        HASH UNIQUE               |          |      49 |   60 |        68 |   +189 |     5 |     4506 |    7M |          |                 |
|  7 |         PX BLOCK ITERATOR        |          |      50 |   14 |        68 |   +189 |     5 |    10000 |       |          |                 |
|  8 |          TABLE ACCESS FULL       | T1       |      50 |   14 |        68 |   +189 |    63 |    10000 |       |          |                 |
|  9 |           SORT AGGREGATE         |          |       1 |      |         1 |   +187 |     1 |        1 |       |          |                 |
| 10 |            PX COORDINATOR        |          |         |      |         1 |   +187 |     4 |        3 |       |          |                 |
| 11 |             PX SEND QC (RANDOM)  | :TQ10000 |       1 |      |         3 |   +185 |     3 |        3 |       |          |                 |
| 12 |              SORT AGGREGATE      |          |       1 |      |        74 |   +114 |     3 |        3 |       |          |                 |
| 13 |               PX BLOCK ITERATOR  |          |   20000 |   18 |        74 |   +114 |     3 |    20000 |       |          |                 |
| 14 |                TABLE ACCESS FULL | T2       |   20000 |   18 |        74 |   +114 |    37 |    20000 |       |          |                 |
| 15 |           SORT AGGREGATE         |          |       1 |      |         1 |   +118 |     1 |        1 |       |          |                 |
| 16 |            PX COORDINATOR        |          |         |      |         1 |   +118 |     3 |        2 |       |          |                 |
| 17 |             PX SEND QC (RANDOM)  | :TQ20000 |       1 |      |         2 |   +111 |     2 |        2 |       |          |                 |
| 18 |              SORT AGGREGATE      |          |       1 |      |       111 |     +2 |     2 |        2 |       |          |                 |
| 19 |               PX BLOCK ITERATOR  |          |   20000 |   27 |       111 |     +2 |     2 |    20000 |       |          |                 |
| 20 |                TABLE ACCESS FULL | T3       |   20000 |   27 |       111 |     +2 |    26 |    20000 |       |          |                 |
==================================================================================================================================================

There’s a little inaccuracy in timings – in the order of the odd second or two – probably because of the granularity used, but the columns “Start Active” and “Time Active (s)” are, nevertheless, very revealing. The first one tells you when a line in the plan first did some work, the second one tells you how long that line has been active. As I walk through the plan remember that the timing granularity is fairly crude, so the numbers don’t quite add up as nicely as we might like.

If you check lines 17 – 20 you can see that DFO tree 2 (:TQ2xxxx) was the first thing to go active with its PX slaves starting at time t = 2 seconds and running for 111 seconds to do that scan and initial aggregation; the PX send line started at time t = 111 seconds and ran for 2 seconds to pass the data from the slaves to the coordinator.

There’s a little glitch (or hand waving moment) around lines15 and 16 where the PX Coordinator receives and aggregates data from the PX slaves – the co-ordinator seems to start doing this several seconds later than it should.

In lines 11 – 14 you can see that DFO tree 1 (:TQ1xxxx) was the second thing to go active with its PX slaves starting at time t = 114 (i.e. just after DFO tree 2 completes);  the full scan and initial aggregate ran for 74 seconds (taking us to time t = 188). The PX send started (a little early, this time) at t = 185 and ran for 3 seconds, with the co-ordinator starting at t = 187 and taking one second to receive and do the final aggregation.

In lines 2 – 8 we can see DFO tree 3 (:TQ3xxxx) starting up, but this DFO tree includes two DFOs. One set of slaves scans table t1 applying the predicates to filter the data then does the initial hash distinct before hash distributing the intermediate result to the next set of slaves that finish off the distinct aggregation.

When DFO :TQ30000 starts at t = 189, the previous DFO trees have completed and the PX coordinator has the results of the two subqueries which it passes to the slave set which can now scan and check for: “n1 >= 500 and  n2 > {derived constant} and n3 > {derived constant}”. The scan and initial aggregation takes 68 seconds completing at t = 255, and at t = 256 the results are distributed to the second set of PX slaves. The second set of slaves has virtually nothing to do and reports only 1 second of execution time before passing the data (DFO :TQ30001) to the query coordinator which does the final aggregation and report.

A key point to note is that this timing information tells us that (a) the subqueries both completed before the tablescan of t1 started, and – perhaps more importantly – that we cannot rely on the TQXnnnn numbers to tell us the order of operation of the DFO trees. In this example DFO tree 2 ran first, DFO tree 1 ran second, and DFO tree 3 ran third – some queries might happen to show the names and the order in synch, that would be a coincidence, not a deliberate design feature.

A little reminder, though – even though we cannot rely on the DFO tree names to tell us the ordering in which the DFO trees operate, when we look at the data flow operations inside the trees (i.e. the last few digits of a TQ name) the order of operation withing a given tree matches the sequence indicated by the TQ name.

Counting Slaves.

There’s more in the Monitor report that can help us understand the workload – and in this case I’ve got a lucky coincidence to help, and a little surprise to go with it.  For parallel queries the report includes a summary of parallel execution activity, and this is what I got from my example:

Parallel Execution Details (DOP=5 , Servers Allocated=15)
===========================================================================================================================
|      Name      | Type  | Group# | Server# | Elapsed |   Cpu   | Concurrency | PL/SQL  |  Other   | Buffer | Wait Events |
|                |       |        |         | Time(s) | Time(s) |  Waits(s)   | Time(s) | Waits(s) |  Gets  | (sample #)  |
===========================================================================================================================
| PX Coordinator | QC    |        |         |    0.03 |    0.02 |             |         |     0.02 |     15 |             |
| p00a           | Set 1 |      1 |       1 |    0.27 |    0.27 |             |    0.10 |          |    163 |             |
| p00b           | Set 1 |      1 |       2 |    0.25 |    0.25 |             |    0.09 |          |    156 |             |
| p00c           | Set 1 |      1 |       3 |    0.26 |    0.26 |             |    0.09 |          |    156 |             |

| p00a           | Set 1 |      2 |       1 |    0.43 |    0.43 |             |    0.15 |          |    221 |             |
| p00b           | Set 1 |      2 |       2 |    0.44 |    0.44 |             |    0.15 |          |    221 |             |

| p000           | Set 1 |      3 |       1 |    0.00 |    0.00 |             |         |          |        |             |
| p001           | Set 1 |      3 |       2 |    0.00 |    0.00 |             |         |          |        |             |
| p002           | Set 1 |      3 |       3 |    0.01 |    0.01 |             |         |          |        |             |
| p003           | Set 1 |      3 |       4 |    0.01 |    0.01 |             |         |          |        |             |
| p004           | Set 1 |      3 |       5 |    0.01 |    0.01 |             |         |          |        |             |
| p005           | Set 2 |      3 |       1 |    0.26 |    0.26 |             |    0.09 |          |    113 |             |
| p006           | Set 2 |      3 |       2 |    0.24 |    0.24 |        0.00 |    0.08 |          |    108 |             |
| p007           | Set 2 |      3 |       3 |    0.25 |    0.25 |             |    0.09 |          |    117 |             |
| p008           | Set 2 |      3 |       4 |    0.25 |    0.25 |             |    0.09 |          |    108 |             |
| p009           | Set 2 |      3 |       5 |    0.25 |    0.25 |        0.00 |    0.09 |          |    117 |             |
===========================================================================================================================

I’ve inserted two blank lines breaking the summary down into three separate groups, which you can identify by the heading “Group#”. In this table we see Group 1 has one slave set of three slaves – which corresponds to the tablescan of t2; Group 2 consists of one slave set of two slaves – which corresponds to the tablescan of t3; and Group 1 has two slave sets of 5 slaves each – which correspond to the tablescan and aggregation of t1. The Group numbers appear to align correctly with the DFO tree numbers.

Another detail that stands out from this list is that slaves p00a and p00b are used in the tablescan of t3, and in the tablescan of t2 (where they are joined by slave p00c). In this example Oracle has reused the slaves from one subquery to run the next. However we can also see that the slaves p000p009 that are used for the tablescan of t1 don’t have names that overlap with the slaves used to scan t2 and t3 – which tells use that we have some (in this case two) DFO trees running concurrently.

Another key point in this article is that not only does a DOP of N mean we could run 2*N slaves concurrently in a parallel query, if we have a query that breaks into a plan that uses multiple DFO trees we might 2 * N * {number of DFO trees) allocated and running concurrently. (And, as we see here, it’s even possible that different DFO trees run with different DOPs, leaving Oracle very confused about what to report as the DOP of the query – dbms_xplan actually reported this one as DOP = 4 !)

A final implementation detail that is not necessarily obvious from this table, but which I can infer because I know the history of what was going on: although, as described above, the tablescan of t3 was the first activity that generated a row source, Oracle started up the 10 slaves for the parallel tablescan and aggregate of t1 before it started the two slaves it needed to scan t3. I can infer this because there were no slave processes active when I started running the query, so it is reasonable to assume that the slave names tell me something about the order in which they were allocated. In general you would not be able to notice this in a busy system where slaves were pre-allocated (parallel_min_servers != 0) and constantly being acquired and released. (I was also able to corroborate some of the details above by looking at v$px_stat and v$session_event for the relevant slave processes shortly after the query started.)

Summary

There are a number of key points this example shows us about complex parallel queries:

  • A single query can produce multiple “DFO trees”
  • Each tree can run at a different degree of parallelism
  • The “tree number” (the X in TQXnnnn) may not reveal the order in which the trees are operated
  • The order of operation of the DFOs within a tree is revealed by the nnnn in TQXnnnn
  • It is not easy to see in a basic execution plan which DFO tree might be executing concurrently and which consecutively
  • PX server processes may be acquired by the query co-ordinator a long time before they actually become active

Footnote:

If you happen to have looked closely at the time information in the Parallel Execution Details you’ll note that the times are total rubbish when compared to the monitoring times. It looks as if the time spent in dbms_lock.sleep() has been ignored (perhaps as an “idle” wait).

 

December 10, 2015

Upgrades

Filed under: 12c,Oracle,Upgrades — Jonathan Lewis @ 8:42 am BST Dec 10,2015

One of the questions that came up at the Optimizer Round Table this year was about minimizing the performance-related** hassle of upgrading from 11g to 12c. Dealing with changes in the optimizer is always an an interesting problem but in 12c this is made more challenging because of the automatic dynamic sampling that can introduce a significant amount of extra work at (hard) parse time, then generate SQL Directives, and finally generate extended (column group) statistics the next time you (or the automatic job) collect stats.

Of course one of the traditional strategies for upgrades (especially if you don’t really have a technical need to, but want your Oracle version to be current) is to set the optimizer_features_enable parameter to the older version and hope that this stops most of the new features from appearing, letting other customers identify the problems and raise the SRs that result in the next patch set.

There’s also the option for reading the manuals and white-papers carefully and identifying the new and enhanced optimizer features that might cause you problems if only you can spot them in time in your code. (The enhancement to subquery unnesting I described in my previous post is such an example – do you know of any code in your application that was carefully engineered to maximise the benefit of scalar subquery caching, if so you should check very carefully that 12c isn’t going to transform away your subquery and wreck your performance.)

As a general strategy, though, it’s worth reviewing your code for four generic features – histograms, multi-column joins, predicates that apply a function to a column, and “out-of-range” issues. The first three are features that are likely to make the 12c optimizer decide to do something “dynamic” that may ultimately give you a better execution plan, but may cost you extra resources or result in increased contention at just the wrong moment; the fourth is one that can disrupt the solution to the multi-column problem.

For the purposes of this note I am going to assume that you don’t use partitioned table or that you have already devised a programmatic method for minimising the work while maximising the effectiveness of their stats (and that your programs unlock and relock the stats so that they aren’t re-gathered by accident).

Histograms

I’m assuming by now that you’re using the approximate_ndv method with auto_sample_size to gather object statistics. In 11g this still means that any histograms gathered by Oracle will use a (surprisingly small) sample and even if you are gathering the histograms at the right time they could introduce a lot of instability to execution plans.

In 12c, Oracle can use an approximate method to create a frequency or Top-N frequency histogram while doing the scan that generates all the other stats – so if you have columns where you know the number of distinct values is relatively small (default up to 254, though technically up to 2048) or that almost all the data (except, broadly speaking, one bucket’s worth) comes from a relatively small number of distinct values then, on the upgrade, you should be safe creating a frequency or Top-N histogram once on those columns and setting the method_opt to ‘for all columns size repeat’.

Oracle 12c still uses a small sample size, though, when it detects the need for a “hybrid” histogram – which is the thing that replaces the height-balanced histogram. This means you may still need to write code to generate the histogram rather than allowing Oracle to collect it. It’s the classic compromise problem – a good histogram will be expensive to gather, a cheaply gathered histogram could easily be inaccurate and unstable. In this case you may want to create a Top-N frequency histogram that is a good model of the data, rather than trying to create a correctly structured hybrid. For tables with such columns you need code that can detect changes to the stats and recreates the histogram. In fact, even in cases where Oracle could gather a (frequency) histogram efficiently and accurately, you may still want to detect changes to stats for some columns and write code to create a histogram that describes the data the way it will look when the users are accessing it rather than the way the data looked when Oracle created the histogram (at 2:00 a.m. or whenever the automatic stats collection job hit it).

Broad strategy: set the global preference to method_opt => ‘for all columns size 1’, set a few table preferences to ‘for all columns size repeat’, and have code that checked the last_analyzed date on a few tables and recreates the histogram you want immediately after the stats have been gathered.

Note: as a general guideline, if you know the database needs histograms to produce the most appropriate execution plans this means the front-end code has to co-operate with the database and not conceal useful information in SQL that uses bind variables that hide the significance of special values or ranges – but that’s a topic for another blog note (if I haven’t written anything about it before).

Virtual Columns and Column Groups

In 12c column expressions (where the optimizer guesses) and multi-column predicates (where the optimizer assumes independence) are key triggers to dynamic sampling and SQL Directives. To engage in a pre-emptive strike the ideal is to know the code and know the data. Search for “important” SQL that applies functions (like trunc(date_col)) to columns in predicates or does multi-column (equality) joins or uses filter predicates that reference multiple columns from the same table.

If the execution plans you find for these statements produce estimates which are clearly far from reality then you may need to take pre-emptive action, even (or, perhaps, especially) if the resulting plans look good. The optimizer may (for example) know that it is guessing when it says to itself: “trunc(delivery_date) = trunc(sysdate) will return 1% of the data” and do several things such as create an adaptive execution plan and switch plans in mid-execution, do a load of dynamic sampling to discover the guess was wrong, dump an SQL Plan Directive into the data dictionary that triggers subsequent dynamic sampling and then create a virtual column through the extended stats mechanism.

In cases like this you may want to create and document virtual columns explicitly, and create column groups explicitly before Oracle does its dynamic thing. In the case of the column groups, you only have to worry about equality predicates, a column group cannot be used with range-based predicates. Remember you are only allowed a maximum of 20 column groups per table (although there is, effectively, no limit on the number of virtual columns) so you need to make your choices of columns groups before Oracle goes onto auto-pilot.

Remember, there’s a trap waiting here if you don’t make this effort.  You might run a test suite a couple of times and fix everything that seems to be necessary without realising that in the course of your testing Oracle has created a couple of hundred column groups and virtual columns. Then, when everything is working nicely, you upgrade the production system and all hell breaks loose until the production system has generated the same (we hope) set of virtual columns and column groups. Since there is a limit of 20 column groups per table (or column groups / 10 if you have more than 200 columns) you could get unlucky and find that Oracle wants to create 30 columns groups and the 20 that happen to be created first on production may not be the ones it created on the test system.

Out of Range

When a predicate specifies a value that is above the known high value or below the known low value for a column, the optimizer uses a “linear decay” method for cardinality calculations that is based on how far out of the range the value is (as a percentage of the range). In many cases this makes introduces a slowly increasing error (though for some predicates it can create an immediate catastrophe). There is a special consideration, though,  that means you need to be very careful about time-based or sequence-based columns that can go out of range: if you have a column group that includes the column then the optimizer stops using the column group the moment you go out of range. This is probably a rare condition to check for because you probably use range-based predicates on such columns (e.g. “order_date < trunc(sysdate) – 7”) and column groups are not applicable to range-based predicates anyway, but you do need to be aware that in special cases an execution plan can go bad the moment you overshoot the high value.

For columns like these you need a strategy that allows you to set a high (or low) value that reflects a future high value for the column – even to the extent (for example) of running a scheduler task that calls dbms_stats.set_column_stats() every hour to push the high value forward a little bit.

Update [11th Dec 2015]

Since this note is about preemptive strategies for reducing the number of problems you run into with 12c, it’s worth reminding you about the new “table preference” which you can use to get more realistic values for the clustering_factor on indexes. For example:


begin
        dbms_stats.set_table_prefs(user, 't1', 'table_cached_blocks', 16);
        dbms_stats.gather_index_stats(user, 't1_i1');
end;
/

Setting the table_cached_blocks preference for a table means Oracle will remember a history of recently “visited” table blocks as it walks an index to generate the clustering_factor. Although there is no official word, I think that 16 is a good default value for this setting in single instance Oracle, and 16 * {number of instances} might be appropriate for RAC.

The fragment above is just for demo purposes – you wouldn’t set the value every time you gather stats on an index, it’s a one-off exercise for each table – though you could change it for all existing tables in a schema with a single call to dbms_stats.set_schema_prefs().

The reason why this preference becomes more important in 12c is that the more expensive an execution path is the longer the optimizer will work to find a better path – and bad settings for the clustering_factor make otherwise good paths looks expensive and encourage the optimizer to try all sorts of before deciding on the final path.

Wrap-up

This is just a brief note to pick up a few key points that you can identify as potential threats before you start upgrading. Obviously it will also be of some help after you’ve upgraded (and are still in test mode) to help you explain some of the changes to execution plans and increases in parse times that you are likely to see. The benefit of pre-emptive action, though, is that you may be able to minimise the number of hidden actions (creation of directives and extended stats) that Oracle takes during your testing. It used to be hard enough doing an upgrade in the past, going up to 12c it can be even harder because Oracle may have changed your test database behind your back when you weren’t looking leaving you running a production system that doesn’t match the system you’ve tested.

** Footnote: if you went to Tim Hall’s presentation you’ll know that part of your upgrade process should deal with the move to the multi-tenant architecture, even if you adopt just the single-PDB licence.

 

December 9, 2015

12c Scalar Subquery

Filed under: 12c,Execution plans,Oracle — Jonathan Lewis @ 2:25 pm BST Dec 9,2015

Every version of the optimizer enhances existing mechanisms and introduces new features and 12c has introduced some of the most sophisticated transformation to date; in this note I want to demonstrate an enhancement to subquery unnesting that could give a significant performance boost to a certain query pattern but which might, unfortunately, result in worse performance.

Historically subquery unnesting turned subqueries (correlated or not) in the where clause into joins. In 12c subquery unnesting can also turn scalar subqueries in the select list into joins – we’ll discuss why this could be a good thing but might occasionally be a bad thing later on in the article, but let’s start with a test case.

Sample data.

In my demonstration I’m going to use three tables which, for convenience, are three clones of the same data.

create table t1
as
with generator as (
	select
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	rownum				id,	
	mod(rownum-1,100)		mod100,
	trunc((rownum - 1)/100)		trunc100,
	rownum				n1,
	rownum				n2,
	lpad(rownum,6,'0')		vc1,
	rpad('x',100)			padding
from
	generator
where
	rownum <= 10000
;

create table t2 as select * from t1;
create table t3 as select * from t1;

create index t1_i1 on t1(id);
create index t2_i1 on t2(id,mod100);
create index t3_i1 on t3(id,trunc100);

begin
	dbms_stats.gather_table_stats(user,'t1');
	dbms_stats.gather_table_stats(user,'t2');
	dbms_stats.gather_table_stats(user,'t3');
end;
/

I’ll be examining a query against t1 that includes two correlated scalar subqueries in the select list that reference one each of t2 and t3:


explain plan for
select
	/*+
		qb_name(main)
	*/
	n1, n2,
	(
		select	/*+ qb_name(sq1) */
			max(mod100)
		from	t2
		where	t2.id = t1.id
	) new_n1,
	(
		select	/*+ qb_name(sq2) */
			max(trunc100)
		from	t3
		where	t3.id = t1.id
	) new_n2
from
	t1
where
	t1.id between 101 and 200
;

select * from table(dbms_xplan.display);

11g Plan

This is the execution plan you might expect to see from 11g – in my case, with my system stats etc. and running 11.2.0.4:


--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |   101 |  1212 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE              |       |     1 |     7 |            |          |
|   2 |   FIRST ROW                  |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN (MIN/MAX)| T2_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   4 |  SORT AGGREGATE              |       |     1 |     7 |            |          |
|   5 |   FIRST ROW                  |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  6 |    INDEX RANGE SCAN (MIN/MAX)| T3_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   7 |  TABLE ACCESS BY INDEX ROWID | T1    |   101 |  1212 |     4   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN           | T1_I1 |   101 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"=:B1)
   6 - access("T3"."ID"=:B1)
   8 - access("T1"."ID">=101 AND "T1"."ID"<=200)

As you can see, the operations for the subqueries appear first in the plan (lines 1-3, and 4-6), with the operations for the main query appearing as the last section of the plan (lines 7-8). You might note that the total cost of the plan doesn’t cater for the cost of the subqueries – technically we might expect to see the optimizer producing a cost of something like 408 on the basis that it’s going to run each subquery an estimated 101 times and each subquery has a cost of 2, and the 101 rows are generated from a query with a cost of 4 giving: 4 + 101 * (2 + 2) = 408.

12c Plan

On the upgrade to 12c, the same code produces the following plan:


--------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |          |   101 |  6464 |     8   (0)| 00:00:01 |
|*  1 |  HASH JOIN OUTER                      |          |   101 |  6464 |     8   (0)| 00:00:01 |
|*  2 |   HASH JOIN OUTER                     |          |   101 |  3838 |     6   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |   101 |  1212 |     4   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                  | T1_I1    |   101 |       |     2   (0)| 00:00:01 |
|   5 |    VIEW                               | VW_SSQ_2 |   101 |  2626 |     2   (0)| 00:00:01 |
|   6 |     HASH GROUP BY                     |          |   101 |   707 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN                 | T2_I1    |   101 |   707 |     2   (0)| 00:00:01 |
|   8 |   VIEW                                | VW_SSQ_1 |   101 |  2626 |     2   (0)| 00:00:01 |
|   9 |    HASH GROUP BY                      |          |   101 |   707 |     2   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN                  | T3_I1    |   101 |   707 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("ITEM_1"(+)="T1"."ID")
   2 - access("ITEM_2"(+)="T1"."ID")
   4 - access("T1"."ID">=101 AND "T1"."ID"<=200)
   7 - access("T2"."ID">=101 AND "T2"."ID"<=200)
  10 - access("T3"."ID">=101 AND "T3"."ID"<=200)

As you can see, the separate plans for the subqueries have disappeared and the plan is showing a three-table join (with two outer joins, and two of the “tables” being the non-mergeable view vw_ssq_2 and vw_ssq_1).

There are several details to pick up in this plan (apart from the unnesting). The cost is only 8 – but in this case it isn’t the effect of the optimizer “ignoring” the cost of the subqueries, it’s the optimizer correctly working out the cost of the unnested subqueries with joins. The cost happens to be low in this case because the optimizer has used transitive closure to pass the predicate from the driving query into the subqueries – so we need only do a couple of short index range scans to get all the data we need in the unnested subqueries.

The outer joins are needed because it is valid for the original scalar subquery mechanism to return no data for a subquery and still report a row (with nulls) for t1. If the rewrite didn’t introduce the outer join then t1 rows for which there were no matching t2 or t3 rows would disappear from the result set.

Threats and workarounds

In this (lightweight) example it looks as if this transformation is a good idea, but it’s always possible that the optimizer might choose to do this when it’s a bad idea. In fact, a quick check of the optimizer trace (10053) suggests that this is an uncosted transformation that will take place “because it can”. Here are six highly suggestive consecutive lines from the trace file:


SU: Unnesting query blocks in query block MAIN (#1) that are valid to unnest.
Subquery Unnesting on query block MAIN (#1)
SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block MAIN (#1).
SU:   Unnesting  scalar subquery query block SQ2 (#2)
Registered qb: SEL$2E540226 0x50b1a950 (SUBQ INTO VIEW FOR COMPLEX UNNEST SQ2)

Even if this transformation is cost-based rather than heuristic it’s always possible for the optimizer to make a very poor estimate of cost and do the wrong thing. Fortunately it’s possible to block the unnesting with the “traditional” /*+ no_unnest */ hint:


select
	/*+
		qb_name(main)
		no_unnest(@sq1)
		no_unnest(@sq2)
	*/
	n1, n2, 
...

With these hints in place the execution plan changes back to the 11g form – though there is a notable change in the estimated final cost of the query:


---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |   101 |  1212 |   206   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                     |       |     1 |     7 |            |          |
|   2 |   FIRST ROW                         |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN (MIN/MAX)       | T2_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   4 |  SORT AGGREGATE                     |       |     1 |     7 |            |          |
|   5 |   FIRST ROW                         |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  6 |    INDEX RANGE SCAN (MIN/MAX)       | T3_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   7 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |   101 |  1212 |     4   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN                  | T1_I1 |   101 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"=:B1)
   6 - access("T3"."ID"=:B1)
   8 - access("T1"."ID">=101 AND "T1"."ID"<=200)

It’s a little surprising that the final cost is (4 + 202) rather than the  (4 + 404) that we calculated earlier, but a few variations on the test suggest that the optimizer is using half the cost of each of the scalar subqueries in the final cost estimate – perhaps as a nod in the direction of scalar subquery caching.

As always it is important to remember that you cannot look at a query like this and know which plan is going to perform better unless you know the data content, the pattern of data distribution, and the probably effect of scalar subquery caching. In some cases it may be that an extremely “lucky” data pattern will mean that a scalar subquery will run a very small number of times while, with the same data arranged in a different order the benefit of scalar subquery caching will disappear and the unnesting approach may be much better.

It is a convenience to be able to recognize that there is a new optimizer feature in 12c that might give you a performance boost but might hit you with an unlucky performance penalty (especially in cases, perhaps, where you have previously engineered your code very carefully to take advantage of scalar subquery caching) and know that there is a workaround available. In your search for pre-empting production performance problems, this code structure is another pattern to scan for in your source code.

December 3, 2015

Five Hints

Filed under: Hints,Oracle,Uncategorized — Jonathan Lewis @ 7:40 am BST Dec 3,2015

This is the content of a “whitepaper” I wrote for my presentation “Five Hints for Optimising SQL” at the recent DOAG conference.

Introduction

Adding hints to production code is a practice to be avoided if possible, though it’s easy to make the case for emergency patching, and hinting is also useful as the basis of a method of generating SQL Plan Baselines. However, notwithstanding (and sometimes because of) the continuing enhancements to the optimizer, there are cases where the only sensible option for dealing with a problem statement is to constrain the broad brush strategy that the optimizer can take in a way that allows it to find a reasonable execution plan in a reasonable time.

This note describes in some detail the use and effects of five of the “classic” hints that I believe are reasonable strategic options to redirect the optimizer when it doesn’t choose a path that you consider to be the most appropriate choice.

The Big Five

At the time of writing, a query against the view v$sql_hint on Oracle 12.1.0.2 reports 332 hints – but there are very few which we should really consider as safe for production code, and it’s best to view even those as nothing more than a medium-term tool to stabilise performance until the optimizer is able to do a better job with our SQL.

The handful of hints that I tend to rely on for solving problems is basically a set of what I call “structural” queries though in recent years it has become appropriate to label them as “query block” hints. These are hints that give the optimizer some idea of the shape of the best plan without trying to enforce every detail of how it should finalize the plan. The hints (with their negatives where appropriate) are:

  • Unnest / no_unnest — Whether or not to unnest subqueries
  • Push_subq / no_push_subq — When to handle a subquery that has not been unnested
  • Merge / no_merge — Whether to use complex view merging
  • Push_pred / no_push_pred — What to do with join predicates to non-merged views
  • Driving_site — Where to execute a distributed query

Inevitably there are a few other hints that can be very helpful, but a key point I want to stress is that for production code I avoid what I call “micro-management” hints (such as use_nl(), index_rs_asc()) – attempts to control the optimizer’s behaviour to the last little detail; it is very easy to produce massive instability in performance once you start down the path of micro-managing your execution plans, so it’s better not to try.

The rest of this document will be devoted to describing and give examples of these hints.

The Optimizer’s Strategy

You can think of the optimizer as working on a “unit of optimization” which consists of nothing more than a simple statement of the form:

select  list of columns
from    list of tables
where   list of simple predicates

To deal with a more complex query the optimizer stitches together a small number (reduced, if it had its way, to just one) of such simple blocks. So one of the first steps taken by the optimizer aims to transform your initial query into a this simple form. Consider this example:


select
        t1.*,v1.*,t4.*
from
        t1,
        (
        select
                t2.n1, t3.n2, count(*)
        from    t2, t3
        where exists (
                select
                        null
                from    t5
                where   t5.id = t2.n1
                )
        and     t3.n1 = t2.n2
        group by t2.n1, t3.n2
        )       v1,
        t4
where
        v1.n1 = t1.n1
and     t4.n1(+) = v1.n1
;

We have an inline view consisting of a two-table join with a subquery correlated to the first table, and from our perspective we have a “simple join” of three objects – t1, v1, and t4. Before it does anything else the optimizer will try to transform this into a straight-line five-table join so that it can join all the tables in order one after the other. As part of that process it will generally attempt to eliminate subqueries in a processing known as unnesting.

Looking at the query as it has been presented author of the code may have been thinking (symbolically) of the underlying problem as:

  • ( ( t1, ( ( t2, subquery t5 ), t3 ) ), t4 )

Take t1, join to it the result of applying the subquery to t2 and joining t3, then join t4.

The optimizer may decide to transform to produce the following:

  • ( ( ( ( t1, t2 ), t3 ), {unnested t5} ), t4 )

Join t2 to t1, join t3 to the result, join the transformed t5 to the result, then join t4 to the result.

If I decide that the original layout demonstrates the appropriate mechanism, my target is to supply the optimizer with just enough hints to lock it into the order and strategy shown, without trying to dictate every little detail of the plan. My hints would look like this:

select
        /*+
            qb_name(main) push_pred(v1@main)
            no_merge(@inline)
            no_unnest(@subq1) push_subq(@subq1)
        */
        t1.*,v1.*,t4.*
from
        t1,
        (
        select  /*+ qb_name(inline) */
                t2.n1, t3.n2, count(*)
        from    t2, t3
        where exists (
                select  /*+ qb_name(subq1) */
                        null
                from    t5
                where   t5.id = t2.n1
                )
        and     t3.n1 = t2.n2
        group by t2.n1, t3.n2
        )       v1,
        t4
where
        v1.n1 = t1.n1
and     t4.n1(+) = v1.n1
;

I’ve labelled the three separate select clauses with a query block name (qb_name() hint), told the optimizer that the query block named “inline” should be considered as a separately optimized block (no_merge(@inline)), and the subquery inside that block called “subq1” should be treated as a filter subquery (no_unnest(@subq1)) and applied as early as possible (push_subq(@subq1)).

In some circumstances I might use one more hint to tell the optimizer to consider a single join order: t1, v1, t4 using the hint /*+ leading(t1 v1 t4) */; but in this case I’ve told the optimizer to push the join predicate v1.n1 = t1.n1 inside the view (push_pred(@inline)) – which will make the optimizer do a nested loop from table t1 to view v1, resolving the view for each row it selects from t1.

Having captured 4 of the “big 5” hints in one sample statement, I’ll now comments on each of them (and the final driving_site() hint separately).

Merge / No_merge

This pair of hints apply particularly to “complex view merging”, but can be used to “isolate” sections of a query, forcing the optimizer to break one large query into a number of smaller (hence easier) sections. I see two main uses for the hints (and particularly the no_merge option) – one is to help the optimizer get started when handling a query with a large number of table, the other is simply to block a strategy that the optimizer sometimes chooses when it is a bad move.

Consider, in the first case, a query involving 20 tables, with several subqueries. With such a long list it is very easy for the optimizer to pick a very bad starting join order and never reach a good join order; moreover, because of the multiplicative way in which the optimizer estimates selectivity it’s very easy for the optimizer to decide after a few tables that the cardinality of the join so far is so small that it doesn’t really matter which table to access next. In cases like this we might start by writing a simpler query joining the first four of five tables that we know to be the key to the whole query – once we have got the core of the query working efficiently we can “wrap” it into an inline view with a no_merge hint, and then join the rest of the tables to it, with some confidence that the optimizer will start well and that it can’t go far wrong with the remainder of the tables so, for example

select  ...
from    t1, t2, t3, ..., t20
where   {various predicates}
and     exists {correlated subquery1}
and     exists {correlated subquery2}
and     column in {non-correlated subquery}

Might become

with v1 as (
        select  /*+ no_merge cardinality(2000) */ ...
        from    t1, t2, t3, t4, t5
        where   {various predicates{
        and     exists {correlated subquery1}
)
select  ...
from    v1, t6, t7, ..., t20
where   {join conditions to v1}
and     {other join conditions}
and     exists {correlated subquery2}
and     column in {non-correlated subquery}
;

I’ve written the example up using subquery factoring; in earlier versions of Oracle the relevant piece of code would have been written as an inline view, but the “with” clause can help to tidy the SQL up and make it easier to see the logic of what’s being done – provided the practice isn’t taken to such extremes that the final query consists of large number of very small factored subqueries.

I’ve included a cardinality() hint in the factored subquery – it’s not fully documented, and it’s not commonly realised that it can be applied to a query block rather than to a table or list of tables. This query block usage is probably the safest example of using the hint – the table-related usage is badly understood and prone to mis-use.

As an example of blocking a badly selected transformation, consider the following query (where I’ve already included qb_name() hints to name the two separate query blocks):

select  /*+ qb_name(main) */
        t1.vc1, avg_val_t1
from    t1,
        (
        Select  /*+ qb_name(inline) */
                id_parent, avg(val) avg_val_t1 
        from	t2
        group by
                id_parent
        ) v1
where	
        t1.vc2 = 'XYZ'
and     v1.id_parent = t1.id_parent
;

There are two basic strategies the optimizer could use to optimize this query, and the choice would depend on its estimate of how much data it had to handle . Whichever choice it makes we might, depending on our better understanding of the data, want it to choose the alternative (without rewriting the query beyond hinting it).

One option is for Oracle to execute the inline view to generate the aggregate data v1 then join the result to t1; the other is to join t2 (the view’s underlying table) to t1 and then work out an aggregation of the join that would give the same result.

If I want to “join then aggregate” I would use the merge hint, if I wanted to “aggregate then join” I would use the no_merge hint. There are three different ways in which I could introduce the hint:

  • In the inline view itself I could simply add the hint “merge”
  • In the main query I could reference the view by view name “no_merge(v1)”
  • In the main query I could reference the inline query block name “no_merge(@inline)”

Note particularly the “@” symbol that I use to point a hint at a query block; and note that this was not needed when I reference the view name. (The reference by query block name is the more modern, preferred strategy.)

Push_pred / No_push_pred

Once we start dealing with non-mergeable views and have to join to them there are two strategies that we could use for the join; the first is (nominally) to create the entire data set for the view and then use that in a merge join or hash join based on the join predicate, or we could “push a join predicate” into the view definition – in other words for each join value we could add a simple filter predicate to the view definition and derive the view result based on that predicate. For example, if we create a database view called avg_val_view with a definition matching the inline view we used in the previous example, we might see one of two possible execution plans for the following query:

select  t1.vc1, avg_val_t1
from    t1, avg_val_view
where   t1.vc2 = 'XYZ'
and     avg_val_view.id_parent = t1.id_parent
;

First – if the view is non-mergeable and we don’t push the predicate, we can see the join predicate appearing at operation 1, as we do a hash join between table t1 and the entire result set from aggregating t2. This may be sensible, but it may be very expensive to create the entire aggregate:

-------------------------------------------------------------------
| Id | Operation            | Name         | Rows  | Bytes | Cost  |
--------------------------------------------------------------------
|  0 | SELECT STATEMENT     |              |     1 |    95 |    27 |
|* 1 |  HASH JOIN           |              |     1 |    95 |    27 |
|* 2 |   TABLE ACCESS FULL  | T1           |     1 |    69 |     2 |
|  3 |   VIEW               | AVG_VAL_VIEW |    32 |   832 |    24 |
|  4 |    HASH GROUP BY     |              |    32 |   224 |    24 |
|  5 |     TABLE ACCESS FULL| T2           |  1024 |  7168 |     5 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("AVG_VAL_VIEW"."ID_PARENT"="T1"."ID_PARENT")
   2 - filter("T1"."VC2"='XYZ')

So we may decide to add the hint /*+ push_pred(avg_val_view) */ to the query – we have to use the view-name method since we don’t have a query block containing the view; if we were using the inline view from the previous query we could have used the “query block” format /*+ push_pred(@inline) */. The plan from pushing predicates is:

--------------------------------------------------------------------
| Id | Operation               | Name        | Rows | Bytes | Cost |
--------------------------------------------------------------------
|  0 | SELECT STATEMENT        |             |    1 |    82 |    7 |
|  1 |  NESTED LOOPS           |             |    1 |    82 |    7 |
|* 2 |   TABLE ACCESS FULL     | T1          |    1 |    69 |    2 |
|  3 |   VIEW PUSHED PREDICATE | AVG_VAL_VIEW|    1 |    13 |    5 |
|* 4 |    FILTER               |             |      |       |      |
|  5 |     SORT AGGREGATE      |             |    1 |     7 |      |
|* 6 |      TABLE ACCESS FULL  | T2          |   32 |   224 |    5 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."VC2"='XYZ')
   4 - filter(COUNT(*)>0)
   6 - filter("ID_PARENT"="T1"."ID_PARENT")

It would actually be a bad idea in this particular case, but if we could access the rows for a given id_parent in t2 efficiently this query could be much more efficient than the previous plan because it would only aggregate the small number of rows that it was going to need at each point, with the smallest row size.

You might note that Oracle has cleverly introduced a filter as operation 4 to eliminate t1 rows where the aggregate would return a row with a zero when there was no matching data. It’s details like this that typical programmers tend to forget when trying to transform SQL by hand.

Unnest / No_unnest

The optimizer prefers joins to subqueries, and will generally try to transform a query to turn a subquery into a join – which often means a semi-join for existence/in, or an anti-join for not exists/not in). As the optimizer has improved with version many such transformations (or decisions to not transform) changed from being driven by rules to being driven by cost – and sometimes we want to override the optimizer because we know its cost calculation is bad. Most commonly we might want to write a query with a subquery – to show our intentions – but tell the optimizer to unnest the subquery: it’s much safer to take this approach rather than to rewrite the query in unnested form ourselves – I’ve seen people do the rewrite incorrectly too many times to trust a user-created rewrite. For example:

select
        /*+ qb_name(main) unnest(@subq) */
        outer.* 
from 
        emp outer
where   outer.sal > (
                select 
                        /*+ qb_name(subq) unnest */
                        avg(inner.sal) 
                from    emp inner 
                where 
                inner.dept_no = outer.dept_no
        )
;

I’ve show the unnest hint here, and demonstrated the two possible forms – you can either use it in the main query block hint to point it at a give query block name (@subq), or you can use it without a “parameter” in the query block you want unnested. In effect the unnest hint causes Oracle to rewrite the query as:

select
        outer.* 
from
        (
        select 
                dept_no, avg(sal) av_sal
        from    emp 
        group by 
                dept_no
        )               inner,
        emp             outer
where 
        outer.dept_no = inner.dept_no
and     outer.sal > inner.av_sal
;

You’ll notice that this gives us an in-line aggregate view, so the optimizer could take (or be pushed) one more step into doing complex view merging as well, joining emp to itself before aggregating on a very messy set of columns.

Here’s the plan if we unnest:

----------------------------------------------------------------
| Id  | Operation            | Name    | Rows  | Bytes | Cost  |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |  1000 | 98000 |   114 |
|*  1 |  HASH JOIN           |         |  1000 | 98000 |   114 |
|   2 |   VIEW               | VW_SQ_1 |     6 |   156 |    77 |
|   3 |    HASH GROUP BY     |         |     6 |    48 |    77 |
|   4 |     TABLE ACCESS FULL| EMP     | 20000 |   156K|    36 |
|   5 |   TABLE ACCESS FULL  | EMP     | 20000 |  1406K|    36 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("ITEM_1"="OUTER"."DEPT_NO")
       filter("OUTER"."SAL">"AVG(INNER.SAL)")

Notice the appearance at operation 2 of a “view” names VW_SQ_1: there are a number of internal view names that appear in Oracle as it transforms queries – the fact that a view name starts with VW_ is a good clue that it’s an internal one. Note, in this particular case that the main work done in the query is the two tablescans of EMP.

Here’s the plan if we don’t unnest:

------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost  |
------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |   167 | 12024 |   252 |
|*  1 |  FILTER             |      |       |       |       |
|   2 |   TABLE ACCESS FULL | EMP  | 20000 |  1406K|    36 |
|   3 |   SORT AGGREGATE    |      |     1 |     8 |       |
|*  4 |    TABLE ACCESS FULL| EMP  |  3333 | 26664 |    36 |
------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("OUTER"."SAL"> (SELECT /*+ NO_UNNEST */
              AVG("INNER"."SAL") FROM "EMP" "INNER" 
              WHERE "INNER"."DEPT_NO"=:B1))
   4 - filter("INNER"."DEPT_NO"=:B1)

The FILTER at operation 1 tells us that nominally the optimizer will run the subquery once for every row in the emp table, but the optimizer costing (252) tells us that it thinks that really it will execute the table scan only 7 times in total (7 * 36 = 252): once for the driving scan and six more times because there are only six departments in my emp table.  (This “how many executions” type of estimate appeared in the costing calculations relatively recently.)

Push_subq / No_push_subq

Once we can control whether or not Oracle will unnest a subquery or run it as a filter we can then choose whether the subquery should run early or late. Historically the optimizer would always leave subqueries to the very end of query operation – but recently the choice of timing acquired a costing component. “Pushing” a subquery means pushing it down the execution tree – i.e. running it earlier in the plan. To demonstrate this we need a minimum of a two-table join with subquery:

select
        /*+ leading(t1 t2) push_subq(@subq) */
        t1.v1
from    t1, t3
where   t1.n2 = 15
and     exists (
                select  --+ qb_name(subq) no_unnest push_subq 
                        null
                from    t2
                where   t2.n1 = 15
                and     t2.id = t1.id
        )
and     t3.n1 = t1.n1
and     t3.n2 = 15
;

In this query I have a subquery where I’ve blocked unnesting, so it has to run as a filter subquery (in passing, I’ve use the alternative, less commonly known, format for hinting: the single-line hint/comment that starts with – – for a comment and – – + for a hint).

I’ve shown the push_subq hint (run the subquery early) in two different ways – first at the top of the query referencing the query block that I want pushed, and then in the subquery itself where it doesn’t need a parameter.

As you can see, the subquery is correlated to table t1 and I’ve told Oracle to examine only the join order t1 -> t3. The effect of the push_subq hint, therefore, is to tell Oracle to run the subquery for each row of t1 that it examines and join any survivors to t3. The alternative is for Oracle to join t1 to t3 and then run the subquery for every row in the result. Depending on the data and indexes available either option might be the more efficient.

Here are the two plans – first if I don’t push the subquery (note the FILTER operation):

--------------------------------------------------------------------
| Id | Operation                    | Name  | Rows | Bytes | Cost  |
--------------------------------------------------------------------
|  0 | SELECT STATEMENT             |       |    1 |    28 |   289 |
|* 1 |  FILTER                      |       |      |       |       |
|* 2 |   HASH JOIN                  |       |  173 |  4844 |   116 |
|* 3 |    TABLE ACCESS FULL         | T1    |  157 |  3140 |    57 |
|* 4 |    TABLE ACCESS FULL         | T3    |  157 |  1256 |    57 |
|* 5 |   TABLE ACCESS BY INDEX ROWID| T2    |    1 |     8 |     2 |
|* 6 |    INDEX UNIQUE SCAN         | T2_PK |    1 |       |     1 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( EXISTS (SELECT /*+ QB_NAME ("SUBQ2") NO_UNNEST */ 0
              FROM "T2" "T2" WHERE "T2"."ID"=:B1 AND "T2"."N1"=15))
   2 - access("T3"."N1"="T1"."N1")
   3 - filter("T1"."N2"=15)
   4 - filter("T3"."N2"=15)
   5 - filter("T2"."N1"=15)
   6 - access("T2"."ID"=:B1)

Then if I push the subquery

--------------------------------------------------------------------
| Id |Operation                     | Name  | Rows | Bytes | Cost  |
--------------------------------------------------------------------
|  0 |SELECT STATEMENT              |       |    9 |   252 |   117 |
|* 1 | HASH JOIN                    |       |    9 |   252 |   115 |
|* 2 |  TABLE ACCESS FULL           | T1    |    8 |   160 |    57 |
|* 3 |   TABLE ACCESS BY INDEX ROWID| T2    |    1 |     8 |     2 |
|* 4 |    INDEX UNIQUE SCAN         | T2_PK |    1 |       |     1 |
|* 5 |  TABLE ACCESS FULL           | T3    |  157 |  1256 |    57 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T3"."N1"="T1"."N1")
   2 - filter("T1"."N2"=15 AND  EXISTS (SELECT /*+ QB_NAME ("SUBQ2")
              PUSH_SUBQ NO_UNNEST */ 0 FROM "T2" "T2" 
              WHERE "T2"."ID"=:B1 AND "T2"."N1"=15))
   3 - filter("T2"."N1"=15)
   4 - access("T2"."ID"=:B1)
   5 - filter("T3"."N2"=15)

Notice how the access to t2 has squeezed itself between t1 and t3 and is also indented one place as a clue that it is a subordinate action on t1, but the FILTER operation visible in the previous plan has disappeared. This plan is an example of a plan that doesn’t follow the well-known “first child first / recursive descent” guideline – Oracle has hidden the FILTER operation and twisted the plan slightly out of its “tradiational” shape as a consequence.

Driving_site

The final hint is for distributed queries, and has no “negative” version. Sometimes the only way you can “tune” a distributed query is to minimise the time spent on network traffic, and this means dictating WHERE the query executes. The driving_site hint lets you make that choice. (Sometimes, having made that choice you also have to include a leading() hint to tell Oracle about the single join order you want it to consider – it’s possible for the optimizer to do some very strange things with distributed queries, especially if the instances have different NLS settings).

Consider the following query (I’ll fill in the XXXX in the hint shortly):

select  /*+ driving_site (XXXX) */
        dh.small_vc,
        da.large_vc
from
        dist_home               dh,
        dist_away@remote_db     da
where
        dh.small_vc like '1%'
and     da.id = dh.id;

This query extracts a small amount of data from a table called DIST_HOME in the local database, and joins it to some data in a table called DIST_AWAY in a remote database, producing a reasonably large number of medium-sized rows. There are basically two obvious plans:

  • nested loop – for each row in dist_home, query dist_away for matching data
  • hash join – create an in-memory hash table from the dist_home data, and then probe it with data from all the rows in dist_away.

The first plan will produce a large number of network round trips – so that’s not very good; the second plan will pull a very large amount of data from the remote database if the query operates at the local database (it’s only the columns we need, but it will be ALL the rows from the remote database).

Choosing the second plan but executing it at the remote database means we’ll send a small parcel of data to the remote database, do the join there to produce (we hope) a reasonable result set, then send it back to the local database. The network traffic will be minimised without causing an undesirable increase in other resource usage. To make this plan happen all I needed to do in the query was change the XXXX in the driving_site() hint to reference a table alias from a table in the remote database, in this case driving_site(da).

Here’s the execution plan:

-----------------------------------------------------------------------
| Id | Operation              | Name     | Rows | Bytes | Inst |IN-OUT|
-----------------------------------------------------------------------
|  0 | SELECT STATEMENT REMOTE|          |  216 | 48600 |      |      |
|* 1 |  HASH JOIN             |          |  216 | 48600 |      |      |
|  2 |   REMOTE               | DIST_HOME|  216 |  4320 |    ! | R->S |
|  3 |   TABLE ACCESS FULL    | DIST_AWAY| 2000 |   400K| TEST |      |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
  1 - access("A1"."ID"="A2"."ID")

Remote SQL Information (identified by operation id):
----------------------------------------------------
  2 - SELECT "ID","SMALL_VC" FROM "DIST_HOME" "A2" WHERE "SMALL_VC" 
      LIKE '1%' (accessing '!')

Notice how the top line (id 0) includes the keyword REMOTE – this tells you that this is the plan from the viewpoint of the remote database/instance that will be executing it. Remember that from its viewpoint the database that we think is the local database is one that it thinks is remote – hence the REMOTE operation 2 which is addressing (our) local table DIST_HOME.

Other key points to note are the appearance of the Inst (instance) and IN-OUT columns. These tell you where each table is located – when a query executes remotely “our” database is tagged only by the name “!”.

A nice feature of the execution plan for a distributed query is that you can see how the query has been decomposed for execution at the remote site. In this case the other database will be sending our database the query at operation 2 to pull the rows it wants from small_vc so that it can do the join at its site and send the result back to us.

The thing you generally don’t want to see in more complex distributed queries is a separate query being generated for each remote table involved in the join – tables that live remotely should be joined remotely with just the join result being pulled back to the local database.

There is a special warning that goes with this hint – it isn’t valid for the select statements in “create as select” and “insert as select”. There seems to be no good reason for this limitation, but for CTAS and “insert as select” the query has to operate at the site of the table that is receiving the data. This means that you may be able to tune a naked SELECT to perform very well and then find that you can’t get the CTAS to use the same execution plan. A typical workaround to this problem is to wrap the select statement into a pipelined function and do a select from table(pipelined_function).

Conclusion

There are literally hundreds of hints available but, as a general guideline, there are only a few that are particularly useful and strategically sound. In this article I’ve listed the five hints that I’ve long considered to be the ones that are of most help and least risk. I have mentioned a couple of other hints in passing, and know that there are a couple of hints in the newer versions of Oracle that should eventually be added to the list; but the five I’ve mentioned give a sound basis to work from in understanding the benefits of using hints that shape the optimizer’s strategy for a query without trying to micro-manage it.

November 30, 2015

Trouble-shooting

Filed under: Oracle,Troubleshooting,Uncategorized — Jonathan Lewis @ 6:03 am BST Nov 30,2015

This is the text of the “whitepaper” I submitted to DOAG for my presentation on “Core Strategies for Troubleshooting”.

Introduction

In an ideal world, everyone who had to handle performance problems would have access to ASH and the AWR through a graphic interface – but even with these tools you still have to pick the right approach, recognise the correct targets, and acquire information at the boundary that tells you why you have a performance problem and the ways in which you should be addressing it.

There are only three ways you can waste resources on an Oracle system, and only three different types of activity that need to be investigated. If you don’t appreciate that this is the case then you can waste a lot of time following the wrong strategy and attempting to solve the wrong problems. Once you have a clear statement of what you are trying to achieve it becomes much easier to achieve those aims.

Three ways to cause problems

There really are only three symptoms you need to look out for in the database

  • You’re doing it the hard way
  • You’re doing it too often
  • You’re queueing / locking

Inevitably you can see that there is going to be some overlap between the three (and it would be easy to argue that the third is simply a side effect of the other two). If you are executing a single statement “the hard way” you’re likely to be doing single block reads or buffer gets too often. If you’re executing a very lightweight statement too often it’s probably a step in a process that is attempting get a job done “the hard way”, and it’s probably causing (and getting stuck in) queues for latches and mutexes. I’ve included queueing as the third option because the simple action of locking data (deliberately, or by accident) doesn’t fall neatly into the other two.

Another way of looking at this list is to reduce it to just two items with a slightly different flavour: when you spend too much time on a task it’s because you’re either doing too much work or you’re not being allowed to work.

Three classes of problems to solve

An important aspect of performance problems is the high-level classification. Labelling the class of problem properly points you to the correct strategy for investigating the problem. Again there are only three possibilities in the list, which I’ll express as typical complaints:

  • My report is taking too long to run / screen is taking to long to refresh
  • The batch job over-ran the SLA last night
  • The “system” is slow

What’s the most significant difference between three classes of complaint ?

  • “My XXX takes too long”: XXX is a repeatable event that can be isolated and watched – just do it again, and again, and again, and again while I watch every step of what’s going on.
  • The batch job over-ran last night: It’s not a repeatable process, so you’ve got to infer what the problem was from historical evidence; it’s (usually) made up of a number of concurrent processes, which may interfere with each other to varying degrees depending on when their relative start and finish times were.
  • The “system” is slow: possibly all the time, possibly intermittently – if there’s no specific complaint then the only option is to keep an eye open for resource-intensive activity to see if you can reduce the resource usage of specific individual tasks (for which read SQL or PL/SQL statements) or reduce the number of times that those tasks are executed.

The common source

Despite the fact that we can classify performance problems in three ways, it’s worth remembering that ALL the information we might use to drive our diagnosis comes from one place – the Oracle database. At any moment we have sessions that are active, operating statements (cursors), and using a resource such as a file, a latch, a buffer, and so on. We could almost represent each moment by a cube with sessions along one side, cursors along another, and resources along the third – the cube changes moment by moment, allowing us to visualise time as the fourth dimension in a hypercube of activity.

Instant by instant the Oracle kernel code knows which session is using what resource to operate which cursor and, although the total volume of all that information is far more than could reasonably be recorded, Oracle has many different ways of slicing, dicing and capturing parts of that hypercube – running totals, rolling aggregates, snapshots by session, by cursor, by resource and so on – that are made visible as the dynamic performance views (v$ objects). Trouble-shooting is largely a case of deciding which dynamic performance views are the most appropriate to use for our three classes of task.

Active Session History

Before reviewing the three classes, it’s worth breaking off for a moment to say a few things about one of the most important and useful views that we have of the wealth of information available; this is the active session history (v$active_session_history / ASH) which Oracle uses to capture a snapshot once every second of what each currently active session is doing; every 10th snapshot is then echoed down into the Automatic Workload Repository (AWR) by a process that runs every hour to copy the dynamic view to a table that can be accessed through the database view dba_hist_active_sess_history.

The capture rates can all be adjusted: I have never seen anyone change from one snapshot per second, or every 10th snapshot in a production system, but I’ve often seen the dump to the AWR taken every 30 minutes, occasionally 20 or even 15 minutes. On occasion I have asked clients to do a CTAS (create table as select – nologging) to capture the whole of the v$active_session_history to a table that can be exported and reviewed at a later date. Oracle tries to keep a minimum of at least an hour’s ASH in memory but, with a large enough SGA, you may find that this will stretch out to 3 or 4 hours.

Unfortunately although (or perhaps because) ASH and its AWR history are extremely helpful, you have to pay extra licence fees to use the information, and the technology can only be licensed with the Enterprise Edition of Oracle. (Which is why there are some ASH emulators on the Internet)

My report is slow

The special feature of someone complaining about a specific task is that it’s likely to be repeatable – so we can run it again and again and watch every single detail to see where the time goes. Our slice through the hypercube could take a single session over a period of time and report every action along that path. This, of course, is the 10046 – a.k.a extended SQL trace event. We can enable it in many ways, perhaps through a logon trigger, perhaps through a call to dbms_monitor:


begin
        dbms_monitor.session_trace_enable( 
                session_id => &m_sid, 
                serial_num => &m_serial,
                waits      => true, 
                bind       => true, 
                plan_stat  => 'all_executions'
        );
end; 
/

In this example I’ve request all wait states and bind variable to be dumped into the trace file, I’ve also requested that the execution plan (with rowsource execution stats) be dumped for every single execution of every single statement. Sometimes a problem arises because a particular set of bind variables represents a special case that causes a “reasonable” plan to behave very badly. If we’re going to look closely we may as well get as much detail as possible.

The entire “trace” interface was upgraded dramatically in 11g, and one of the useful variants on this theme is particularly relevant to a commonly used Web-based implementation. If you know that a specific screen task corresponds to a particular PL/SQL package you can enable tracing of a cursor (across the system, if necessary) by SQL_ID. So, for example, you might issue the following two commands, with a couple of minutes gap between the two:


alter system
        set events '
                sql_trace[SQL:1wthpj7as7urp]
                plan_stat=all_executions,
                wait=true, 
                bind=true
        '
;

-- wait a few minutes

alter system
        set events 'sql_trace[SQL:1wthpj7as7urp] off'
;

Every time the statement with SQL_ID =‘1wthpj7as7urp’ is executed, the session executing it will start adding information to the session trace file, and when the statement ends the tracing will end. This is particularly nice if the “statement” is a top-level call to a PL/SQL procedure because all the SQL inside the procedure will be traced as the package executes.

For a highly focused, highly repeatable task, the 10046 trace event is almost always all you need to do.

The batch over-ran

The big difference between this case and the previous one is that “the batch” is not something you can simply repeat and watch. Moreover, “the batch” is likely to be a large number of separate sections of code that are scheduled to run with a fairly fluid timetable that can result in changes from day to day (or, more likely, night to night) in the set of jobs that might be running concurrently. This means that even if you could re-run the batch job (perhaps on the previous night’s backup) you might not see the same problem appear because a small change in timing could result in a large change in contention).

One of the most important steps of dealing with the batch is pre-emptive: instrument your code and make it possible to compare the run on one night with the run on another. At the very least you need to have something capturing the start and end times of each “significant component” of the batch so you can quickly answer questions like: “which jobs took much longer than usual”, “which job was the first job that took longer than usual”, “which jobs were running concurrently with job X last night when they never usually overlap?”

Ideally you should have much more information than this about each job – basically a report from Oracle which says “how much work did I do, how much time did I spend”: for a session this is simply a report of v$mystat or v$sesstat (joined to v$statname) and v$session_event for the session (v$mystat is a less well- known view that is v$sesstat restricted to “my” session) ; if you classify each job as “connect to the database, do something, disconnect” then this critical log is simply a pair of select statements spooled out somewhere convenient, or written to the database; if you want to break a single connection into significant pieces then a simple pl/sql procedure could read the statistics into a pl/sql array as the piece starts, then re-read the stats and calculate the differences as the piece ends.

Knowing where the time went, and knowing how tasks have behaved differently from previous runs is a big step forward to identifying the problem.

If you don’t have the instrumentation you need then the AWR (if you’re licensed) or Statspack (if you’re not licensed) is a step in the right direction. Apart from the typical hourly snapshots and reports you can take a snapshot as the first and last steps of the batch so that you’ve got “the whole batch” in a single AWR/Statspack report. If you’ve got that you can then do comparisons for things like:

  • Which event consumed much more time than usual
  • Which SQL took much more time than usual
  • Which segment(s) saw much more activity than usual
  • Was there some unusual enqueue activity
  • Can we see some unusual outliers in the event histograms
  • Can we see some unusual memory demands in the pga histogram

Although system-wide summaries rarely point us at exact causes, they can often give us strong clues of areas (and times) where problem originated.

In this respect the “Top Activity” screen from Enterprise Manager (Grid Control / Cloud Control) can be very helpful as it produces a nice graphical presentation of “working”. Where, in the picture of last night’s activity, does the graph start to ramp up, and what colour is the bit that’s growing, and how does that picture compare to the same picture the previous night. (Having two windows open with two different nights makes it wonderfully easy to switch between displays and spot the differences.) Since the “historical” top activity screen is created from the dba_hist_active_sess_history, which contains about 100 different details per session of each captured moment, it’s very easy to drill though the spikes to answer questions like: “which object”, “what event”, “which SQL”, “what was the execution plan”, “how much work did that take”, to follow the chain of time back to the cause.

The system is slow

If no-one is going to tell you about specific tasks, and if you don’t have any sort of boundary that allows you to focus on tasks or time-ranges, then the simplest thing to do is look for anything expensive (i.e. time-consuming) and see if you can make it cheaper.

Again, the graphic “Top Activity” screen is very helpful, and I often tell people to arrange to have a system that shows the top activity screens for the most important 2 or 3 databases on a large screen on the wall where any passing DBA might notice a brief spike in workload. There are systems that can be improved by constant monitoring – so long as the monitoring doesn’t take out 100% of an individual’s time but is driven as an informal glance at a picture.

If you’re not licensed to take advantage of the AWR then Statspack can help – but with the smallest time interval (though 15 minutes is as low as I’ve ever gone) between snapshots so that “anomlies” that are short-lived don’t fall out of memory before they can be captured.

An important feature of reading Statspack is that you need to check for missing information – if the headline figure for physical reads is 25M but the “SQL ordered by reads” is 12M then you know that there must be 13M reads that didn’t get captured in the report and that might be the 13M that is causing the problem. Similarly if the “Segments by physical reads” reports 16M reads that’s 4M more than the SQL – but is the 12M a subset of the 16M, or is there only a 3M overlap between the two figures so that between them the 12M and 16M cover the entire 25M. There’s more information in the Statspack report than immediately meets the eye, and a careful drilldown into other areas of the report (typically the Instance Activity) may be needed to clarify the meaning of what you’re seeing as a headline figure.

The other thing you can do with “the slow system” when you don’t have ASH to help is take snapshots (or get some freeware to do the same sort of thing). If the system is slow “right now” you could, for example, take a snapshot of v$sess_io (session I/O), wait 30 seconds then take another snapshot, find the difference and see who is doing most of the I/O work – then chase that session; or take snapshots of v$sesstat limited to (say) statistics like “%redo%” and find out who is generating lots of redo.

Oracle allows you to take this approach back into recent history – there are a number of “metric” views which give you thing like the rolling average, min, and max I/O volumes for the last few intervals of 5 minutes or 1 minute each – telling you, in other words, whether there were any interesting bursts of extreme activity in the recent past. For example, a query against v$sysmetric_summary might give you an output like the following:

METRIC_NAME                     MAXVAL     AVERAGE STANDARD_DEV METRIC_UNIT
------------------------ ------------- ----------- ------------ -----------------------
Physical Reads Per Sec        1,618.95      105.92       358.16 Reads Per Second
Physical Reads Per Txn       97,202.00    5,539.19    20,811.56 Reads Per Txn
Redo Generated Per Sec    6,773,108.94  218,132.86 1,023,458.57 Bytes Per Second
User Calls Per Txn              395.00       43.39        79.85 Calls Per Txn
Total Parse Count Per Sec        31.14        1.88         4.25 Parses Per Second
Host CPU Utilization (%)         64.51        3.93         9.07 % Busy/(Idle+Busy)
Database Time Per Sec            82.96        6.65        15.37 CentiSeconds Per Second
I/O Megabytes per Second         35.58        2.62         5.73 Megabtyes per Second

This summarises the last 12 intervals of 5 minutes. If we look at “Physical Reads per Txn” we can see that there were some extreme swings in activity over that period, so we could drill down into v$sysmetric_history for “Physical Reads per txn”, looking at the 1 minute granularity and see:

METRIC_UNIT                    BEGIN_TIME                VALUE
Physical Reads Per Txn         05-feb 12:45:55          421.00
                               05-feb 12:44:55          477.00
                               05-feb 12:43:55          351.00
                               05-feb 12:42:55          406.84
                               05-feb 12:41:55        1,550.00
                               05-feb 12:40:55       93,984.00
                               05-feb 12:39:55       97,202.00
                               05-feb 12:38:55       93,323.00
                               05-feb 12:37:55          391.00
                               05-feb 12:36:55          504.00
                               05-feb 12:35:55          504.00
                               05-feb 12:34:55          252.00

Yes, a few minutes ago something exploded onto the system doing a huge amount of I/O for about 3 minutes. If we’re lucky we might now drill into the v$sesstat, or v$sess_io, or v$session_event to see if we can find a session that is responsible for a large amount of I/O; and then check v$open_cursor to see if it still has some open cursors that might (if we check v$sql) show us what caused the I/O.

When there are no specific complaints, we just keep an eye open for spikes in activity and try track them down as quickly and cheaply as possible to see if they’re worth addressing.

Summary

Oracle gives you a huge amount of information about the work that’s going on and the time that’s being used in the database. Unfortunately the most useful repository of that information is in a dynamic performance view that can only be viewed in the Enterprise Edition after purchasing additional licences. However, the information is summarised, in many different ways in literally hundreds of other dynamic performance views, and it’s easy to pick out helpful information from those views in a variety of ways.

Key to making the best use of those views, though, is recognising that different classes of performance problems require different strategies – and there are only three different classes of problems to worry about.

November 24, 2015

Table Expansion

Filed under: 12c,Bugs,Oracle,Partitioning — Jonathan Lewis @ 12:13 pm BST Nov 24,2015

I’ve often found that while I’m investigating one Oracle feature I get waylaid by noticing anomalies in other parts of the code. I was caught by one of these events a little while ago while experimenting with the new (12.1.0.2) Inmemory Columnar Store.  After reading a posting by Martin Bach I asked the question:

“If you have a partitioned table with a local index and one of the table partitions has been declared INMEMORY, would a query that could use that index be able to apply table expansion to produce a plan that did a tablescan on the in-memory partition and an indexed access path on the partitions that weren’t in-memory?”

The answer was yes, with one important caveat – the first test I built to investigate the question did something very silly and produced the wrong results. In fact the error turned out to have nothing to do with the inmemory columnar store, it also appeared when I disabled the inmemory feature off and nudged the optimizer into table expansion by making one of the index partitions unusable. Here’s the demo, followed by a few notes, if you want to test it for yourselves:


create table t1 (
        id,
        n1,
        padding
)
partition by range (id) interval (250000) (
        partition p_start values less than (250000)
        segment creation immediate
        inmemory priority critical
)
nologging
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum - 1              id,
        mod(rownum - 1,1250)    n1,             -- 200 rows per segment
        rpad('x',400,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

begin dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create index t1_i1 on t1(n1) local nologging;

I’ve created an interval-partitioned table holding a million rows with an interval of 250,000, and I’ve defined the first partition with an upper bound of 250,000 and the inmemory attribute, leaving Oracle to add the other three partitions which won’t have the inmemory attribute set.

I’ve created a local index on the n1 column, with 200 rows (1,250 distinct values) per partition. Because of the way I’ve defined n1 I can easily recreate the table to adjust the relative code of the index accessed path and the tablescan path by adjusting the value I use in the mod() function. The use of the mod() function also means that every partition holds the same volume of data (with exactly the same degree of scattering) for any specific value of n1.

To test the option for table expansion I’ve then checked the execution plan for (and run) four queries:


select
        /*+ full(t1) */
        id
from
        t1
where
        n1 = 1
and     id < 250000
;

select
        /*+ index(t1) */
        id
from
        t1
where
        n1 = 1
and     id < 250000 
;
 
select
        /*+ full(t1) */
        id 
from    t1
where   n1 = 1
and     id >= 250000
and     id <  500000
;

select
        id
from
        t1
where
        n1 = 1
;

The first two queries give me the cost of accessing the inmemory partition by tablescan compared to the cost of accessing it by index. The third query is there to demonstrate that the non-inmemory tablescan is a LOT more expensive than the inmemory one. The final query accesses all four partitions to see if the inmemory partition is treated differently from the other three. Here are the four plans:

Default tablescan when limited to the in-memory partition
----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |   200 |  1800 |   184   (9)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE SINGLE     |      |   200 |  1800 |   184   (9)| 00:00:01 |     1 |     1 |
|*  2 |   TABLE ACCESS INMEMORY FULL| T1   |   200 |  1800 |   184   (9)| 00:00:01 |     1 |     1 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - inmemory("N1"=1 AND "ID"<250000)
       filter("N1"=1 AND "ID"<250000)



Index access is more expensive than the inmemory tablescan
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |   200 |  1800 |   201   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE SINGLE                    |       |   200 |  1800 |   201   (0)| 00:00:01 |     1 |     1 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1    |   200 |  1800 |   201   (0)| 00:00:01 |     1 |     1 |
|*  3 |    INDEX RANGE SCAN                        | T1_I1 |   200 |       |     1   (0)| 00:00:01 |     1 |     1 |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<250000)
   3 - access("N1"=1)



Tablescan on a non-inmemory partition is much higher than inmemory (and indexed access)
-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |   200 |  1800 |   1891  (2)| 00:00:01 |       |       |
|   1 | PARTITION RANGE SINGLE|       |   200 |  1800 |   1891  (2)| 00:00:01 |     2 |     2 |
|*  2 | TABLE ACCESS FULL     | T1    |   200 |  1800 |   1891  (2)| 00:00:01 |     2 |     2 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"=1 AND "ID">=250000 AND "ID"<500000)



Access all four partitions - no table expansion. Cost = 4 x cost of single partition indexed access path.
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |   800 |  7200 |   807   (1)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL                       |       |   800 |  7200 |   807   (1)| 00:00:01 |     1 |1048575|
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1    |   800 |  7200 |   807   (1)| 00:00:01 |     1 |1048575|
|*  3 |    INDEX RANGE SCAN                        | T1_I1 |   800 |       |     6   (0)| 00:00:01 |     1 |1048575|
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=1)

So we can see that the inmemory tablescan of a single segment is cheaper than the indexed access path, and much cheaper than the tablescan of a single non-inmemory segment. When we execute the query that accesses all four segments, though, the optimizer chooses to use the same access path for all four partitions rather then splitting the query into one tablescan and three indexed accesses.

This is just a starting point for testing though – there are transformations where the optimizer will only use a particular transformation if the transformed query has a cost that is “X%” lower than it would be without the transformation (and some of these transformations have associated parameters – usually hidden – that can be set to adjust the percentage). Perhaps if I adjusted the data so that the relative benefit of a single inmemory scan was larger; or if I created a table with more partitions and had two inmemory and three not inmemory, or 3 inmemory and 4 not inmemory; or if I had more inmemory partitions than not then maybe the optimizer would spontaneously do table expansion.

Rather than fiddle about with the data my next step was to hint the final query with /*+ expand_table(t1) */. Here’s the resulting execution plan:


------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |         |       |       |  1797 (100)|          |       |       |
|   1 |  VIEW                                        | VW_TE_1 |  1600 | 20800 |  1797   (1)| 00:00:01 |       |       |
|   2 |   UNION-ALL                                  |         |       |       |            |          |       |       |
|   3 |    PARTITION RANGE SINGLE                    |         |   200 |  1800 |   184   (9)| 00:00:01 |     1 |     1 |
|*  4 |     TABLE ACCESS INMEMORY FULL               | T1      |   200 |  1800 |   184   (9)| 00:00:01 |     1 |     1 |
|   5 |    PARTITION RANGE ITERATOR                  |         |   600 |  5400 |   806   (1)| 00:00:01 |     2 |     4 |
|*  6 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |   600 |  5400 |   806   (1)| 00:00:01 |     2 |     4 |
|*  7 |      INDEX RANGE SCAN                        | T1_I1   |   800 |       |     5   (0)| 00:00:01 |     2 |     4 |
|   8 |    PARTITION RANGE INLIST                    |         |   800 | 16800 |   807   (1)| 00:00:01 |       |       |
|   9 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |   800 | 16800 |   807   (1)| 00:00:01 |       |       |
|* 10 |      INDEX RANGE SCAN                        | T1_I1   |   800 |       |     6   (0)| 00:00:01 |       |       |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - inmemory(("N1"=1 AND "T1"."ID"<250000))
       filter(("N1"=1 AND "T1"."ID"<250000)) 6 - filter(("T1"."ID">=250000 AND "T1"."ID"<1000000))
   7 - access("N1"=1)
  10 - access("N1"=1)

Points to notice:
Table expansion has occurred – we did an inmemory full tablescan on the inmemory partition at a cost of 184, and index range scans on the other three partitions at a cost of 805 (which looks more like the cost of visiting 4 partitions), expecting 800 rowids from the local index (a failure of partition elimination) and 600 rows from the table. The partition start/stop columns do show that the optimizer expects to visit just the three correct partitions, although some of the cost and cardinality numbers seem to be appropriate to 4 partitions.

Worse still we see a third branch to the UNION ALL / table expansion – operations 8 to 10 – which don’t report a partition start and stop. What’s that doing ? Sometimes, of course, we see sections of plan that don’t actually run – usually preceded by a FILTER operation that can evaluate to FALSE – so maybe that’s what’s happening here. Re-running the query with rowsource execution stats enabled it turned out that the PARTITION RANGE INLIST started once, and operations 9 and 10 didn’t operate at all – so that’s okay.

But what happens if I execute a query that should only access the first two partitions ? Here’s the run-time plan to answer that question:


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID  5dgp982ffsph8, child number 0
-------------------------------------
select  /*+ expand_table(t1) gather_plan_statistics */         id from
       t1 where  n1 = 1 and id < 500000

Plan hash value: 2876620067

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |         |      1 |        |    800 |00:00:00.05 |     649 |
|   1 |  VIEW                                        | VW_TE_1 |      1 |   1200 |    800 |00:00:00.05 |     649 |
|   2 |   UNION-ALL                                  |         |      1 |        |    800 |00:00:00.01 |     649 |
|   3 |    PARTITION RANGE SINGLE                    |         |      1 |    200 |    200 |00:00:00.01 |       3 |
|*  4 |     TABLE ACCESS INMEMORY FULL               | T1      |      1 |    200 |    200 |00:00:00.01 |       3 |
|   5 |    PARTITION RANGE SINGLE                    |         |      1 |    200 |    200 |00:00:00.01 |     215 |
|*  6 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |      1 |    200 |    200 |00:00:00.01 |     215 |
|*  7 |      INDEX RANGE SCAN                        | T1_I1   |      1 |    200 |    200 |00:00:00.01 |      15 |
|   8 |    PARTITION RANGE INLIST                    |         |      1 |    800 |    400 |00:00:00.01 |     431 |
|   9 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |      2 |    800 |    400 |00:00:00.01 |     431 |
|* 10 |      INDEX RANGE SCAN                        | T1_I1   |      2 |    800 |    400 |00:00:00.01 |      31 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - inmemory(("N1"=1 AND "T1"."ID"<250000))
       filter(("N1"=1 AND "T1"."ID"<250000))
   6 - filter(("T1"."ID"<500000 AND "T1"."ID">=250000))
   7 - access("N1"=1)
  10 - access("N1"=1)

My query should have returned 400 rows, accessing only the first two partitions. It returned 800 rows, accessing all 4 partitions. (Notice how there are no predicates on ID for operations 9 and 10).

Since I had started my investigation with the inmemory option I assumed at this point that the problem was somehow related to the that feature; which shows how easy it is to get too focused and jump to conclusions. After raising the problem with Oracle I got a reply that the problem wasn’t about the inmemory columnar store – and here’s the next little change to test to demonstrate that point:


alter index t1_i1 indexing partial;

alter table t1 modify partition p_start indexing off;
alter table t1 modify partition p_start no inmemory;

select  partition_name, indexing, inmemory 
from    user_tab_partitions 
where   table_name = 'T1';

select  partition_name, segment_created    
from    user_ind_partitions 
where   index_name = 'T1_I1';

Enable partial indexing for the index, switch off the index on the first partition then disable the inmemory option for the partition. This is the plan I got from re-running the two-partition query:


---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |         |      1 |        |    800 |00:00:00.08 |   15370 |  14706 |
|   1 |  VIEW                                        | VW_TE_1 |      1 |   1200 |    800 |00:00:00.08 |   15370 |  14706 |
|   2 |   UNION-ALL                                  |         |      1 |        |    800 |00:00:00.07 |   15370 |  14706 |
|   3 |    PARTITION RANGE SINGLE                    |         |      1 |    200 |    200 |00:00:00.01 |     216 |      0 |
|*  4 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |      1 |    200 |    200 |00:00:00.01 |     216 |      0 |
|*  5 |      INDEX RANGE SCAN                        | T1_I1   |      1 |    200 |    200 |00:00:00.01 |      16 |      0 |
|   6 |    PARTITION RANGE SINGLE                    |         |      1 |    200 |    200 |00:00:00.04 |   14723 |  14706 |
|*  7 |     TABLE ACCESS FULL                        | T1      |      1 |    200 |    200 |00:00:00.04 |   14723 |  14706 |
|   8 |    PARTITION RANGE INLIST                    |         |      1 |    800 |    400 |00:00:00.01 |     431 |      0 |
|   9 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |      2 |    800 |    400 |00:00:00.01 |     431 |      0 |
|* 10 |      INDEX RANGE SCAN                        | T1_I1   |      2 |    800 |    400 |00:00:00.01 |      31 |      0 |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("T1"."ID"<500000 AND "T1"."ID">=250000))
   5 - access("N1"=1)
   7 - filter(("N1"=1 AND "T1"."ID"<250000))
  10 - access("N1"=1)


Again I accessed all 4 partitions and returned 800 rows. As an odd little detail the plan reversed the order of access of partitions 1 and 2. The problem isn’t about the inmemory option, it’s a problem with generic table expansion.

Continuing Tests
I won’t go into all the details of what I did next – once you get started it’s hard to stop, and easy to lose track of what you’ve done. So here’s a quick summary.

Given the problem appeared without tthe inmemory columnar store enabled, I switched back to 11.2.0.4 (where table expansion is also possible) and emulated the problem by setting the first index partition unusable (since partial indexing is a 12c feature). Table expansion did not occur even when hinted: the plan was a full tablescan on both partitions.

So I switched the table from being interval partitioned to simple range partitioned, creating all 4 partitions as I created the data. In 11g I got table expansion and the correct answer without the extra branch to the union all; so I went back to 12c and did the same – pure range partitioning, partial indexing, and got table expansion with the correct result and no spare branches to the plan.

Tentative Conclusion

Clearly my testing is by no means exhaustive – but 12c seems to allow table expansion for interval partitioning in cases where 11g does not; unfortunately it is possible for the table expansion code to go wrong for interval partitioning in cases where simple range partitioning does not. In my case this led to wrong results.

 

Reference Script: in_memory_table_expansion_*.sql

 

 

November 9, 2015

Wrong Results

Filed under: 12c,Bugs,Oracle — Jonathan Lewis @ 6:23 am BST Nov 9,2015

Here’s a little gem in 12c that arrived in my email a few days ago: a query where the result depends on the SQL*Plus arraysize!

The email had a short description, and a script to create a small data set that would demonstrate the problem. I’m not going to show you the query, or the result set, but here’s a sample of the output from an SQL*Plus session after creating the data. This is, by the way, on a “single-user” system – there is no way that some other session is changing the data – especially after the opening “set transaction”:

SQL> set transaction read only;

Transaction set.

SQL> set arraysize 1
SQL> select ...

...

541 rows selected.

SQL> set arraysize 4
SQL> select ...

...

599 rows selected.

SQL> set arraysize 10
SQL> select ...

...

620 rows selected.

SQL> set arraysize 32
SQL> select ...

...

616 rows selected.

The correct result set should have had the 616 rows reported when the arraysize was set to 32 (of, course, it’s possible with an arraysize of 32 the 616 rows returned weren’t the correct 616 rows – rows seemed to get multiplied or deleted fairly arbitrarily as the arraysize changed).

The execution plan was a little unusual in that it forced a nested loop join with a tablescan on the inner table; and when I ran the query with rowsource execution statistics enabled the number of starts of the inner tablescan was 1,597 but the number of rows actually returned varied. My first thought was that some new mechanical optimisation of the tablescan code was losing track of where it had got to in the repeated tablescans – but it turned out I was wrong.

Here’s the execution plan (with camouflage) – the key detail is in a section I didn’t look at intially, the column projection:


select * from table(dbms_xplan.display_cursor('0dh0kh9qa88mz',1,'-note +projection'));

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |       |       | 14118 (100)|          |
|   1 |  NESTED LOOPS       |      |    29 |  2958 | 14118   (2)| 00:00:01 |
|*  2 |   HASH JOIN         |      |   892 | 57088 |    35   (3)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| ABC  |   549 | 21411 |    17   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| DEF  |   892 | 22300 |    17   (0)| 00:00:01 |
|*  5 |   TABLE ACCESS FULL | XYZ  |     1 |    38 |    16   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DEF"."ABC_FK"="ABC"."ABC_ID")
   5 - filter(("DEF"."COL0"="XYZ"."COL0" AND "XYZ"."COL1"="ABC"."COL1"
              AND "XYZ"."COL2"="ABC"."COL2"))

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - "ABC"."ABC_ID"[NUMBER,22], "DEF"."ABC_FK"[NUMBER,22],
       "ABC"."COL2"[NUMBER,22], "ABC"."COL1"[NUMBER,22],
       "DEF"."COL0"[VARCHAR2,20], "XYZ"."COL1"[NUMBER,22],
       "XYZ"."COL2"[NUMBER,22], "XYZ"."COL0"[VARCHAR2,20]
   2 - (#keys=1) "ABC"."ABC_ID"[NUMBER,22], "DEF"."ABC_FK"[NUMBER,22],
       "ABC"."COL2"[NUMBER,22], "ABC"."COL1"[NUMBER,22],
       "DEF"."COL0"[VARCHAR2,20]
   3 - (rowset=200) "ABC"."ABC_ID"[NUMBER,22], "ABC"."COL1"[NUMBER,22],
       "ABC"."COL2"[NUMBER,22]
   4 - (rowset=200) "DEF"."ABC_FK"[NUMBER,22], "DEF"."COL0"[VARCHAR2,20]
   5 - "XYZ"."COL1"[NUMBER,22], "XYZ"."COL2"[NUMBER,22],
       "XYZ"."COL0"[VARCHAR2,20]

The predicate section is irrelevant in this case, and I’ve camouflaged the names of the tables and columns – the only interesting bit is the appearance of the (rowset=200) in the projection information. This is reporting a feature new in 12c (and not to be confused with Oracle Java Rowsets) that should improve the performance of some queries.

I didn’t actually look at the projection information until after I’d asked the Oak Table members if they had ever seen this type of anomaly before – and Stefan Koehler emailed back to suggest that the problem might be related to rowsets (there are a couple of similar bugs on MoS, e.g: 17016479 and 20960570) – so I checked the projection, then repeated my tests after disabling the feature with a call to: ‘alter session set “_rowsets_enabled”=false;’

Problem solved – although I’ve told the person who emailed me to report this discovery and workaround to Oracle support and see what they supply as the approved solution.

It is possible to affect the feature through event 10055 – different levels disable it at different locations in the code; the list of options is given in the $ORACLE_HOME/rdbms/mesg/oraus.msg file (if you’re not running Windows):


//            Level:
//            0x00000001 - turn off for table scan
//            0x00000002 - turn off for hash join consume
//            0x00000004 - turn off for hash join produce
//            0x00000008 - turn off for group by
//            0x00000010 - turn off for sort
//            0x00000020 - turn off for table-queue out
//            0x00000040 - turn off for table-queue in
//            0x00000080 - turn off for identity
//            0x00000100 - turn off for granule iterator
//            0x00000200 - turn off for EVA functions
//            0x00000400 - turn off for PL/SQL
//            0x00000800 - turn off for upgrade
//            0x00001000 - turn off for database startup
//            0x00002000 - turn off for blobs and clobs
//            0x00004000 - turn off for tracing row source
//            0x00008000 - turn off rowset information in explain plan
//            0x00010000 - disable hash join rowsets fast path
//            0x00020000 - turn off for bloom create
//            0x00040000 - turn off for bloom use
//            0x00080000 - disable prefetch for hash join
//            0x00100000 - disable prefetch for bloom
//            0x00200000 - disable semi blocking hash join
//            0x00400000 - turn off rowset for fixed table

I tried the first few levels and found that both levels 1 and 2 eliminated the problem (and eliminated the appearance of the (rowset=200) entry in the projection information). Given the shape of the plan I had thought that just one of 1,2 or 4 might have been relevant so I was a little surprised to find that both 1 and 2 were effective – but that’s probably just a question of interpretation of the brief descriptions.

Update

I’ve asked the owner of the problem if it’s okay to post the script to create the tables and data – and the answer was yes: the content was already heavily camouflaged anyway. So here’s a file you can download if you want to test other environments: insert_script_12c_bug

It’s declared as a “.doc” file to get past the upload process, but it’s really a flat text file.

Update 16th Nov

The official “minimum impact” workaround is to set event 10055 at level 2097152 (disable semi blocking hash join). Alternatively there is also a patch available. See Mike Dietriech’s blog for details: https://blogs.oracle.com/UPGRADE/entry/update_for_switch_off_rowsets

Update 7th Dec

Mike Dietriech has updated his second posting to add details of a patch for this bug.

 

November 6, 2015

Filter Hash

Filed under: Execution plans,Hints,Oracle — Jonathan Lewis @ 6:43 am BST Nov 6,2015

One of the most irritating features of solving problems for clients is that the models I build to confirm my diagnosis and test my solutions often highlight further anomalies, or make me ask questions that might produce some useful answers to future problems.

Recently I had cause to ask myself if Oracle would push a filter subquery into the second tablescan of a hash join – changing a plan from this:

filter
	hash join
		table access full t1
		table access full t2
	table access by rowid t3
		index range scan t3_i1

to this:

hash join
	table access full t1
	filter
		table access full t2
		table access by rowid t3
			index range scan t3_i1

or, perhaps more likely, to this:

hash join
	table access full t1
	table access full t2
		table access by rowid t3
			index range scan t3_i1

The final variation here is an example where the FILTER operation itself is swallowed up in line 3 of the plan, twisting the body of the plan in a way that makes the “first child first” rule of thumb lead to an incorrect interpretation. I’ve discussed this pattern of behaviour before, but in the earlier cases the “missing filter” has either applied to an index or to the first table of the hash join.

The type of query where the the strategy for pushing a filter subquery into the second table of a hash join might be appropriate would be something like the following (although in this simple case we’d probably expect Oracle to unnest the subquery and turn it into a semi-join):

select
        t1.n1,
        t2.n1
from
        t1, t2
where
        mod(t1.n1,100) = 0
and     t2.id = t1.id           -- join condition with a possible order t1 -> t2
and     exists (
                select          -- subquery that could be pushed against t2
                        null
                from    t3
                where   t3.id = t2.n1
        ) 
;

The benefit of using a filter subquery and pushing it would only appear in specific circumstances – you would would need the number of executions of the subquery to be significantly larger AFTER the hash join than BEFORE in order for the early subquery filter to be a good idea.

Since there are always special cases that can be improved by carefully selected optimisation strategies I created three tables to find out what plans I could produce by blocking unnesting and trying to push the filter subquery. Here’s the code I used for the tables:


create table t1 nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        rownum                  n1,
        rpad('x',100)           padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5
;

create table t2 nologging as
select * from t1;

create table t3 nologging as
select * from t1;

create index t3_i1 on t3(id);

-- gather stats if needed (version dependent) with no histograms

With this data in place I can experiment with hinting the path I want to see; there are two basically two parts to the hints I need, the first in the main query to control the join: /*+ leading (t1 t2) use_hash(t2) no_swap_join_inputs(t2) */, the second in the subquery /*+ no_unnest push_subq */. So here are a couple of plans – first without the push_subq hint:


Plan hash value: 2281699686

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |       |      1 |        |   926 (100)|   1000 |00:00:00.94 |    5409 |       |       |          |
|*  1 |  FILTER             |       |      1 |        |            |   1000 |00:00:00.94 |    5409 |       |       |          |
|*  2 |   HASH JOIN         |       |      1 |   1000 |   425   (5)|   1000 |00:00:00.91 |    3295 |  1888K|  1888K| 1502K (0)|
|*  3 |    TABLE ACCESS FULL| T1    |      1 |   1000 |   214   (6)|   1000 |00:00:00.03 |    1614 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2    |      1 |    100K|   209   (3)|    100K|00:00:00.23 |    1681 |       |       |          |
|*  5 |   INDEX RANGE SCAN  | T3_I1 |   1000 |      1 |     1   (0)|   1000 |00:00:00.02 |    2114 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( IS NOT NULL)
   2 - access("T2"."ID"="T1"."ID")
   3 - filter(MOD("T1"."N1",100)=0)
   5 - access("T3"."ID"=:B1)


In the absence of the push_subq hint the optimizer has taken the hash join (operations 2 – 4) and filtered late (operations 1 and 5).

When I included the push_subq hint this is what I got in 11.2.0.4:


Plan hash value: 2281699686

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |       |      1 |        |   424 (100)|   1000 |00:00:00.94 |    5409 |       |       |          |
|*  1 |  FILTER             |       |      1 |        |            |   1000 |00:00:00.94 |    5409 |       |       |          |
|*  2 |   HASH JOIN         |       |      1 |   1000 |   423   (5)|   1000 |00:00:00.91 |    3295 |  1888K|  1888K| 1535K (0)|
|*  3 |    TABLE ACCESS FULL| T1    |      1 |   1000 |   214   (6)|   1000 |00:00:00.03 |    1614 |       |       |          |
|   4 |    TABLE ACCESS FULL| T2    |      1 |   5000 |   209   (3)|    100K|00:00:00.23 |    1681 |       |       |          |
|*  5 |   INDEX RANGE SCAN  | T3_I1 |   1000 |      1 |     1   (0)|   1000 |00:00:00.02 |    2114 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( IS NOT NULL)
   2 - access("T2"."ID"="T1"."ID")
   3 - filter(MOD("T1"."N1",100)=0)
   5 - access("T3"."ID"=:B1)

The plan hasn’t changed!

Clearly the shape of the plan hasn’t changed, the numbers for Starts and A-rows haven’t changed, the Buffers haven’t changed, the Time hasn’t changed – in fact the session stats for the two queries were virtually identical. Subquery pushing has clearly NOT taken place. But take a look at the E-rows and Cost: operation 4 in the “pushed” plan reports E-Rows = 5,000 which is the classic 5% for an existence subquery when compared with the E-rows = 100K in the first plan; the cost of the hash join is slightly smaller, and the cost of the whole query has halved – but the run-time engine is doing the same amount of work and following the same plan. The optimizer seems to have pushed the arithmetic, without pushing the subquery!

I could force subquery pushing to take place if I reversed the join order – and all I have to do is change the main hint to /*+ leading (t2 t1) use_hash(t1) no_swap_join_inputs(t1) */ to see this happen; here’s the resulting plan:


------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |   424 (100)|   1000 |00:00:02.02 |     104K|       |       |          |
|*  1 |  HASH JOIN         |       |      1 |   1000 |   423   (5)|   1000 |00:00:02.02 |     104K|  5984K|  2337K| 5601K (0)|
|*  2 |   TABLE ACCESS FULL| T2    |      1 |   5000 |   209   (3)|    100K|00:00:01.31 |     102K|       |       |          |
|*  3 |    INDEX RANGE SCAN| T3_I1 |    100K|      1 |     1   (0)|    100K|00:00:00.58 |     101K|       |       |          |
|*  4 |   TABLE ACCESS FULL| T1    |      1 |   1000 |   214   (6)|   1000 |00:00:00.03 |    1681 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."ID"="T1"."ID")
   2 - filter( IS NOT NULL)
   3 - access("T3"."ID"=:B1)
   4 - filter(MOD("T1"."N1",100)=0)

You can see (as I implied earlier on) that it was a bad idea to push the subquery with this data set; the subquery has now run 100,000 times adding an extra 1.08 seconds of CPU to the run-time activity; but I’m only trying to establish a principle, so I’m not worried about that. Perhaps, having got subquery pushing in this plan, I could change that no_swap_join_inputs(t1) hint to a swap_join_inputs(t1) to see the plan I want with lines 2 and 3 below line 4 – and here’s what I get when I do:


------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |   424 (100)|   1000 |00:00:01.97 |     104K|       |       |          |
|*  1 |  HASH JOIN         |       |      1 |   1000 |   423   (5)|   1000 |00:00:01.97 |     104K|  1888K|  1888K| 1499K (0)|
|*  2 |   TABLE ACCESS FULL| T1    |      1 |   1000 |   214   (6)|   1000 |00:00:00.02 |    1614 |       |       |          |
|*  3 |   TABLE ACCESS FULL| T2    |      1 |   5000 |   209   (3)|    100K|00:00:01.28 |     103K|       |       |          |
|*  4 |    INDEX RANGE SCAN| T3_I1 |    100K|      1 |     1   (0)|    100K|00:00:00.56 |     101K|       |       |          |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."ID"="T1"."ID")
   2 - filter(MOD("T1"."N1",100)=0)
   3 - filter( IS NOT NULL)
   4 - access("T3"."ID"=:B1)

So we can get where we want to be by starting backwards and reversing the join order! You might notice, by the way, that in the last two plans the optimizer “thinks” it will have to run the subquery 5,000 (or possibly 100,000) times, but the cost of the query is still less than the initial case where the optimizer thought it would have to run the subquery just 1,000 times. (You can see these numbers by looking at the E-rows that feed the filter operation.)

Summary

In this particular case it doesn’t make sense to force the plan I’ve managed to achieve – when filter subqueries are involved the patterns in the data can make a huge difference to performance – but in demonstrating that I can get to a plan that I want I’ve had to work through the option of starting with the wrong join order and then swapping sides on the hash join, and I’ve demonstrated in passing that there is a curious costing anomaly that could affect the optimizer’s choice in more complex executions plans.

Reference script: filter_hash.sql

November 5, 2015

Column Groups

Filed under: CBO,extended stats,Oracle,Statistics — Jonathan Lewis @ 6:48 am BST Nov 5,2015

I think the “column group” variant of extended stats can be amazingly useful in helping the optimizer to generate good execution plans because of the way they supply better details about cardinality; unfortunately we’ve already seen a few cases (don’t forget to check the updates and comments) where the feature is disabled, and another example of this appeared on OTN very recently.

Modifying the example from OTN to make a more convincing demonstration of the issue, here’s some SQL to prepare a demonstration:


create table t1 ( col1 number, col2 number, col3 date);

insert  into t1
select 1 ,1 ,to_date('03-Nov-2015') from dual
union all
select 1, 2, to_date('03-Nov-2015')  from dual
union all
select 1, 1, to_date('03-Nov-2015')  from dual
union all
select 2, 2, to_date('03-Nov-2015')  from dual
union all   
select 1 ,1 ,null  from dual
union all  
select 1, 1, null  from dual
union all  
select 1, 1, null  from dual
union all
select 1 ,1 ,to_date('04-Nov-2015')  from dual
union all  
select 1, 1, to_date('04-Nov-2015')  from dual
union all  
select 1, 1, to_date('04-Nov-2015')  from dual
;

begin
        dbms_stats.gather_table_stats(
                ownname         => user,
                tabname         => 'T1',
                method_opt      => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname         => user,
                tabname         => 'T1',
                method_opt      => 'for columns (col1, col2, col3)'
        );
end;
/

I’ve collected stats in a slightly unusual fashion because I want to make it clear that I’ve got “ordinary” stats on the table, with a histogram on the column group (col1, col2, col3). You’ll notice that this combination is a bit special – of the 10 rows in the table there are three with the values (1,1,null) and three with the values (1,1,’04-Nov-2015′), so some very clear skew to the data which results in Oracle gathering a frequency histogram on the table.

These two combinations are remarkably similar, so what happens when we execute a query to find them – since there are no indexes the plan will be a tablescan, but what will we see as the cardinality estimate ? Surely it will be the same for both combinations:


select  count(*)
from    t1
where
        col1 = 1
and     col2 = 1
and     col3 = '04-Nov-2015'
;

select  count(*)
from    t1
where
        col1 = 1
and     col2 = 1
and     col3 is null

Brief pause for thought …

and here are the execution plans, including predicate section – in the same order (from 11.2.0.4 and 12.1.0.2):


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    12 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    12 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     3 |    36 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("COL1"=1 AND "COL2"=1 AND "COL3"=TO_DATE(' 2015-11-04
              00:00:00', 'syyyy-mm-dd hh24:mi:ss'))


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    12 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |    12 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     1 |    12 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("COL3" IS NULL AND "COL1"=1 AND "COL2"=1)

The predictions are different – the optimizer has used the histogram on the column group for the query with “col3 = to_date()”, but not for the query with “col3 is null”. That’s a bit of a shame really because there are bound to be cases where some queries would benefit enormously from having a column group used even when some of its columns are subject to “is null” tests.

Analysis

The demonstration above isn’t sufficient to prove the point, of course; it merely shows an example of a suspiciously bad estimate. Here are a few supporting details – first we show that both the NULL and the ’04-Nov-2015′ combinations do appear in the histogram. We do this by checking the column stats, the histogram stats, and the values that would be produced by the hashing function for the critical combinations:


set null "n/a"

select distinct
        col3,
        mod(sys_op_combined_hash(col1, col2, col3), 9999999999)
from    t1
where
        col3 is null
or      col3 = to_date('04-Nov-2015')
order by
        2
;

column endpoint_actual_value format a40
column column_name           format a32
column num_buckets           heading "Buckets"

select
        column_name,
        num_nulls, num_distinct, density,
        histogram, num_buckets
from
        user_tab_cols
where
        table_name = 'T1'

break on column_name skip 1

select
        column_name,
        endpoint_number, endpoint_value,
        endpoint_actual_value -- , endpoint_repeat_count
from
        user_tab_histograms
where
        table_name = 'T1'
and     column_name not like 'COL%'
order by
        table_name, column_name, endpoint_number
;

(For an explanation of the sys_op_combined_hash() function see this URL).

Here’s the output from the three queries:


COL3      MOD(SYS_OP_COMBINED_HASH(COL1,COL2,COL3),9999999999)
--------- ----------------------------------------------------
04-NOV-15                                           5347969765
n/a                                                 9928298503

COLUMN_NAME                       NUM_NULLS NUM_DISTINCT    DENSITY HISTOGRAM          Buckets
-------------------------------- ---------- ------------ ---------- --------------- ----------
COL1                                      0            2         .5 NONE                     1
COL2                                      0            2         .5 NONE                     1
COL3                                      3            2         .5 NONE                     1
SYS_STU2IZIKAO#T0YCS1GYYTTOGYE            0            5        .05 FREQUENCY                5


COLUMN_NAME                      ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
-------------------------------- --------------- -------------- ----------------------------------------
SYS_STU2IZIKAO#T0YCS1GYYTTOGYE                 1      465354344
                                               4     5347969765
                                               6     6892803587
                                               7     9853220028
                                              10     9928298503

As you can see, there’s a histogram only on the combination and Oracle has found 5 distinct values for the combination. At endpoint 4 you can see the combination that includes 4th Nov 2015 (with the interval 1 – 4 indicating a frequency of 3 rows) and at endpoint 10 you can see the combination that includes the null (again with an interval indicating 3 rows). The stats are perfect to get the job done, and yet the optimizer doesn’t seem to use them.

If we examine the optimizer trace file (event 10053) we can see concrete proof that this is the case when we examine the “Single Table Access Path” sections for the two queries – here’s a very short extract from each trace file, the first for the query with “col3 = to_date()”, the second for “col3 is null”:


ColGroup (#1, VC) SYS_STU2IZIKAO#T0YCS1GYYTTOGYE
  Col#: 1 2 3    CorStregth: 1.60
ColGroup Usage:: PredCnt: 3  Matches Full: #1  Partial:  Sel: 0.3000


ColGroup (#1, VC) SYS_STU2IZIKAO#T0YCS1GYYTTOGYE
  Col#: 1 2 3    CorStregth: 1.60
ColGroup Usage:: PredCnt: 2  Matches Full:  Partial:

Apparently “col3 is null” is not a predicate!

The column group can be used only if you have equality predicates on all the columns. This is a little sad – the only time that the sys_op_combined_hash() will return a null is (I think) when all its input are null, so there is one very special case for null handling with column groups – and even then the num_nulls for the column group would tell the optimizer what it needed to know. As it is, we have exactly the information we need to get a good cardinality estimate for the second query, but the optimizer is not going to use it.

Summary

If you create a column group to help the optimizer with cardinality calculations it will not be used for queries where any of the underlying columns is used in an “is null” predicate. This is coded into the optimizer, it doesn’t appear to be an accident.

Reference script: column_group_null.sql

November 3, 2015

Nul points

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 8:16 am BST Nov 3,2015

(To understand the title, see this Wikipedia entry)

The title could also be: “Do as I say, don’t do as I do”, because I want to remind you of an error that I regularly commit in my demonstrations. Here’s an example:

 
SQL> create table t (n number); 

Table created 

Have you spotted the error yet ? Perhaps this will help:

SQL> insert into t select 1 - 1/3 * 3 from dual; 

1 row created. 

SQL> insert into t select 1 - 3 * 1/3 from dual; 

1 row created. 

SQL> column n format 9.99999999999999999999999999999999999999999 
SQL> select * from t; 

                                           N
--------------------------------------------
  .00000000000000000000000000000000000000010
  .00000000000000000000000000000000000000000

2 rows selected. 

Spotted the error yet ? If not then perhaps this will help:

SQL> select * from dual where 3 * 1/3 = 1/3 * 3;

no rows selected 

SQL> select * from dual where 3 * (1/3) = (1/3) * 3; 

D
-
X

1 row selected. 

Computers work in binary, people (tend to) work in decimal. 10 = 2 * 5, and 5 (more precisely, dividing by 5) is something that a computer cannot do accurately. So when you do arbitrary arithmetic you should use some method to deal with tiny rounding errors.

In Oracle this means you ought to define all numbers with a precision and scale. Look on it as another form of constraint that helps to ensure the correctness of your data as well as improving performance and reducing wasted storage space.

November 2, 2015

Clustering_factor

Filed under: Indexing,Oracle,RAC,Statistics — Jonathan Lewis @ 10:27 am BST Nov 2,2015

I had a recent conversation at Oracle OpenWorld 2015 about a locking anomaly in a 3-node RAC system which was causing unexpected deadlocks. Coincidentally, this conversation came about shortly after I had been listening to Martin Widlake talking about using the procedure dbms_stats.set_table_prefs() to adjust the way that Oracle calculates the clustering_factor for indexes. The juxtaposition of these two topics made me realise that the advice I had given in “Cost Based Oracle – Fundamentals” 10 years ago was (probably) incomplete, and needed some verification. The sticking point was RAC.

In my original comments about setting the “table_cached_blocks” preference (as it is now known) I has pointed out that the effect of ASSM (with its bitmap space management blocks) was to introduce a small amount of random scattering as rows were inserted by concurrent sessions and this would adversely affect the clustering_factor of any indexes on the table, so a reasonable default value for the table_cached_blocks parameter would be 16.

I had overlooked the fact that in RAC each instance tries to acquire ownership of its own level 1 (L1) bitmap block in an attempt to minimise the amount of global cache contention.  If each instance uses a different L1 bitmap block to allocate data blocks then (for tables and their partitions) they won’t be using the same data blocks for inserts, and they won’t even have to pass the bitmap blocks between instances when searching for free space. The consequence of this, though, is that if N separate instances are inserting data into a single table there are typically 16 * N different blocks into which sessions could be inserting concurrently, so the “most recent” data could be scattered across 16N blocks, which means the appropriate value table_cached_blocks is 16N.

To demonstrate the effect of RAC and multiple L1 blocks, here’s a little demonstration code from a 12c RAC database with 3 active instances.


create tablespace test_8k_assm_auto
datafile size 67108864
logging online permanent
blocksize 8192
extent management local autoallocate default
nocompress segment space management auto
;

create table t1 (n1 number, c1 char(1000)) storage (initial 8M next 8M);

The code above simply creates a tablespace using locally managed extents with system allocated extent sizes, then creates a table in that tablespace with a starting requirement of 8MB. Without this specification of initial the first few extents for the table would have been 64KB thanks to the system allocation algorithm, and that would have spoiled the demonstration because the table would have started by allocating a single extent of 64KB, with just one L1 bitmap block; slightly different effects would also have appeared with an extent size of 1MB – with 2 L1 bitmap blocks – which is the second possible extent size for system allocation.

Having created the table I connected one session to each of the three instances and inserted one row, with commit, from each instance. Then I ran a simple SQL statement to show me the file and block numbers of the rows inserted:


select
        dbms_rowid.rowid_relative_fno(rowid)    file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        count(*)                                rows_in_block
from
        t1
group by
        dbms_rowid.rowid_relative_fno(rowid),
        dbms_rowid.rowid_block_number(rowid)
order by
        dbms_rowid.rowid_relative_fno(rowid),
        dbms_rowid.rowid_block_number(rowid)
;


   FILE_NO   BLOCK_NO ROWS_IN_BLOCK
---------- ---------- -------------
        19        518             1
        19        745             1
        19       2157             1

As you can see, each row has gone into a separate block – more significantly, though, those blocks are a long way apart from each other – they are in completely different sets of 16 block – each instance is working with its own L1 block (there are 16 of them to choose from in an 8MB extent), and has formatted 16 blocks associated with that L1 for its own use.

In fact this simple test highlighted an anomaly that I need to investigate further. In my first test, after inserting just 3 rows into the table I found that Oracle had formatted 288 blocks (18 groups of 16) across 2 extents, far more than seems reasonable. The effect looks hugely wasteful, but that’s mainly because I’ve implied that I have a “large” table into which I’ve then inserted very little data – nevertheless something a little odd has happened. In my second test it got worse because Oracle formatted 16 blocks on the first insert,  took that up to 288 blocks on the second insert, then went up to 816 blocks (using a third extent) on the third insert; then in my third test Oracle behaved as I had assumed it ought to, formatting 3 chunks of 16 blocks each in a single extent – but that might have been because I did a truncate rather than a drop and recreate.

Summary

Whatever else is going on, the key point of this note is that if you’re trying to get Oracle to give you a better estimate for the clustering_factor in a RAC system then “16 * instance-count” is probably a good starting point for setting the table preference known as table_cached_blocks.

The anomaly of data being scattered extremely widely with more extents being allocated than you might expect is probably a boundary condition that you don’t have to worry about – until I’ve had time to look at it a little more closely.

 

October 23, 2015

Histogram Limit

Filed under: Bugs,Histograms,Oracle,Statistics — Jonathan Lewis @ 8:03 pm BST Oct 23,2015

A surprising question came up on OTN a couple of days ago:

Why does a query for “column = 999999999999999999” run slower than a query for “column > 999999999999999998” (that’s 18 digit numbers, if you don’t want to count them). With the equality predicate the query is very slow, with the range-based predicate perfomance is good.

In the absence of further information there are various reasons why this is possible – but the example in question was about a “versioning” table where the single very large value was used as the “not yet ended” value for the history of an id so, at a minimum, the table held columns (id, nstart, nend, other), and each id could appear many times with pairs of start and end values that supplied non-overlapping, covering ranges and one row that had the very large number as the end value.

Let’s jot down a few ideas about what the data (and stats) might look like.

Assuming every id appears “a few” time and every id has to have a “still valid” row this means that a very large fraction (say 10% to 25%, if “a few” means 4 to 9) of the rows hold the value 999999999999999999.

If you gather stats without a histogram then you should get the low and high, finding that the high is 999999999999999999 and that the range is enormous, and so the predicates “column = {high value}” and “column > {high value} – 1” should give very similar cardinalities.

If you collect stats with a histogram you should find the very popular high value even in a very small sample set (which is what happens with histogram collection in 11g, and even in 12c for hybrid histograms). In this case the histogram should spot the significance of the high value and again the two predicates should have very similar cardinalities.

At first sight there doesn’t seem to be a feasible way that the two cardinalities could be sufficiently different to cause a problem – so maybe there’s something about character conversion or maybe bind variable usage that hasn’t been mentioned. So to test a couple of the less likely ideas I built a data set using 11.2.0.4 – and found a bug:


create table t1 (
        id              number(18),
        nend            number(18),
        n1              number(18),
        small_vc        varchar2(10)
)
nologging
;

insert /*+ append */ into t1
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum -1               id,
        case mod(rownum - 1,4)
                when 0  then 999999999999999999
--              when 0  then 999999999999999
                        else mod(rownum - 1, 250000)
        end                     nend,
        rownum - 1              n1,
        rpad('x',10,'x')        small_vc
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 ; commit; select * from t1 where nend = 0; begin dbms_stats.gather_table_stats( ownname => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size auto'
        );
end;
/

I have an nend column that is set to 999999999999999999 every 4th row in the table and otherwise has 4 rows per value for 187,500 other (relatively low) values. It’s probably a reasonable initial model of the original data. I’ve run a query with a predicate referencing nend before gathering stats so that the (default) auto option will build a histogram for nend. Then I’ve checked the execution plans for two critical queries:


explain plan for
select  *
from    t1
where   nend = 999999999999999999
;

select * from table(dbms_xplan.display);

explain plan for
select  *
from    t1
where   nend > 999999999999999998
;

select * from table(dbms_xplan.display);

And here’s the surprise – the two plans, in order:

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |     4 |   108 |   625   (9)| 00:00:04 |
|*  1 |  TABLE ACCESS FULL| T1   |     4 |   108 |   625   (9)| 00:00:04 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("NEND"=999999999999999999)


--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   250K|  6591K|   627   (9)| 00:00:04 |
|*  1 |  TABLE ACCESS FULL| T1   |   250K|  6591K|   627   (9)| 00:00:04 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("NEND">999999999999999998)


Clearly this should not happen – the equality test is way off, the inequality test is correct. The obvious first guess is that something funny has happened with the statistics so let’s see what they look like – the column stats (user_tab_cols) and the histogram stats (user_tab_histograms) seem like a good starting point:


select
        sample_size, num_distinct, histogram, num_buckets,
        substr(low_value,1,26) low_value, substr(high_value,1,26) high_value
from
        user_tab_cols
where
        table_name = 'T1'
and     column_name = 'NEND'
;

column endpoint_value format 999,999,999,999,999,999,999

select
        endpoint_number, endpoint_value -- , endpoint_repeat_count
from
        user_tab_histograms
where
        table_name = 'T1'
and     column_name = 'NEND'
order by
        endpoint_number
;

And here are the results (with a couple of hundred uninteresting rows eliminated from the histogram):


      Sample     Distinct HISTOGRAM          Buckets LOW_VALUE                  HIGH_VALUE
------------ ------------ --------------- ---------- -------------------------- --------------------------
       5,541      190,176 HEIGHT BALANCED        254 C102                       C9646464646464646464


ENDPOINT_NUMBER               ENDPOINT_VALUE
--------------- ----------------------------
              0                           19
              1                        1,225
              2                        2,503
              3                        3,911
              4                        4,806
...
            188                      247,479
            189                      248,754
            190                      249,862
            254    1,000,000,000,000,000,000

Oracle will have started with a 100% sample to collect stats on all the columns, but taken a small sample to test the need for a histogram on the nend column – and that’s why the sample size of 5,541 has appeared, but that’s not relevant to the problem in hand. The big question comes from endpoint_number 254 – why is the highest value in the histogram 1e19 when we know (and the column stats show) that the highest value is actually 999999999999999999 ?!

It’s a question to which I don’t have an answer – but I do know that

  • if your high value is 15 digits long (all 9s) then the histogram shows the right high value
  • if your high value is more that 15 9s then the histogram shows the high value plus 1
  • the value collected by the query that Oracle runs is the actual value (i.e. 18 9s)
  • if you use set_column stats to set 18 9s as the high value you still get 1e19 in the histogram

Once you see the stats you can understand why the OP sees the odd performance problem. If the histogram identifies 1e19 as a (very) popular value, leaving 999999999999999 as an “average” value with only 4 rows; on the other hand the query for greater than 999999999999999998 can see that there really are 250K rows with higher values.

Footnote:

Interestingly 12c does the same with the stats – introducing the 1e19 in the histogram – but still manages, somehow, to calculate the correct cardinality in the equality case. (There is one slight difference in 12c, the histogram is a hybrid histogram, not a height-balanced histogram).

Update 24th Oct 2015:

I realised late last night that I had written about this behaviour before – though possible just as a response on OTN or in some unpublished notes. According to the notes in a script called histogram_numeric_bug.sql that I found on my laptop this is “Bug 18514507 : WRONG CARDINALITY ESTIMATES WHEN NUMERIC VALUE IS LONGER THAN 15 BYTES”. (except that the base bugs that that one is linked to are probably not the same bug).

I suspect the problem is related to the way that character histograms are built based on a numeric representation of the string that takes the first few characters of the string, treats that resulting N bytes as a hex number, converts to decimal and then applies round(N,-14) to restrict the precision stored. The effect with strings is that (broadly speaking) you get fifteen digits precision – which is exactly what I seem to be seeing with numbers.

The reason that 12c can get the right answers despite storing the wrong endpoint_value is that it’s also storing to_char() of the right value as the endpoint_actual_value – possibly doing this any time it has had to round the endpoint_value (for character strings the endpoint_actual_value was only populated if two entries in the endpoint_value were the same).

 

 

 

 

October 12, 2015

Read Consistency

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 7:25 pm BST Oct 12,2015

I posted a note a few days ago about read consistency, the Cross Session PL/SQL Function Result Cache, deterministic functions, and scalar subqueries. The intent of the article was to make clear the point that while you might think that declaring a PL/SQL function to be deterministic or in the PL/SQL Result Cache might make a query that calls the function perform faster, if that function contained its own SQL statement then your code might not be producing self-consistent results and (even worse) if you had used the Result Cache option your code might actually cause other session to get wrong results if you tried to “set transaction read only” or “alter session set isolation_level = serializable”

It occurred to me a couple of days after writing that article that perhaps there’s a much more basic point I ought to make as well. I’ll make it about PL/SQL but it’s something that applies across all programming languages – it merits being italicised and emphasised, and possibly even the addition of 5 exclamation marks:

Any time you execute more than one SQL statement in a PL/SQL procedure the results of executing that procedure may not be self-consistent unless you have explicitly locked your session SCN to a fixed value!!!!!

Similarly: if you run a report from SQL*Plus which goes: “select this, select that, select the total of the other” – your report may not be self-consistent; if you have a web-application which goes: “select the customer balance, select the customer orders outstanding, select the customer credit notes” – your screenful may not be self-consistent.

By default Oracle operates at statement-level read-consistency for SQL. In a multi-user system if you run two SQL statements and someone else changes the data and commits between your two statement executions then the result from the second statement may not be consistent with the result from the first. Here’s a sillly little bit of code you can use to demonstrate the concept:

set serveroutput on

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4 
) 
select 
        rownum id,    
        rownum n1,
        rownum n2,
        rpad('x',10,'x')  small_vc,
        rpad('x',100,'x') padding 
from 
        generator v1 
; 


create table t2 ( 
        id, 
        v1, 
        constraint t2_pk primary key(id) 
) 
organization index 
as 
select 
        cast(1 as number (6,0)), 
        cast('one' as varchar2(10)) 
from dual 
;
 
begin 
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

/*
   ----------------------------------------------------
   Table t3 only relevant to "serializable" experiments
   ----------------------------------------------------

create table t3 (
        id      number,
        v1      varchar2(10),
        padding varchar2(100) default rpad('x',100,'x')
)
initrans 4
;

alter table t3 add constraint t3_pk primary key(id) using index(
        create unique index t3_pk on t3(id) initrans 4
)
;

insert into t3 values (0,'0',rpad('x',100,'x'));
commit;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T3',
                method_opt       => 'for all columns size 1'
        );
end;
/

   ----------------------------------------------------
   End of optional t3 definition
   ----------------------------------------------------
*/

prompt  ===========================================================
prompt  Four second pause - only needed to avoid flashback problems
prompt  ===========================================================

execute dbms_lock.sleep(4)

-- ------------------------------------------------------------------------------
-- Options for "fixing" the SCN - plenty of scope for ORA-08177 when serializable
-- ------------------------------------------------------------------------------

-- set transaction read only;
-- execute dbms_flashback.enable_at_time(systimestamp)
-- execute dbms_flashback.enable_at_system_change_number(dbms_flashback.get_system_change_number)
-- alter session set isolation_level = serializable;

prompt  =====================
prompt  Starting PL/SQL block
prompt  =====================

declare
        m_v1    varchar2(10);
begin
        for r in (select * from t1 where rownum <= 10) loop

                select  v1
                into    m_v1
                from    t2
                where   t2.id = r.n2 - r.n1 + 1
                ;

--              insert into t3 values(r.id, m_v1, rpad('x',100));

                dbms_output.put_line(r.id || ' - ' || m_v1);
                dbms_lock.sleep(1);

        end loop;

end;
/

execute dbms_flashback.disable;
commit;

--
--      Code to be run from a second session once the 
--      first session displays the "loop running" banner
--

begin
        for r in 1..20 loop
                update t2 set v1 = lpad(r,10,0);
                commit;
                dbms_lock.sleep(0.5);
        end loop;
end;
.

The code covers several different tests – the basic test and the three “read-only” tests don’t need table t3 at all, which exists only so that we have something to insert into (and show a possible ORA-08177) for the serializable test. If you do the serializable test you might want to note what happens if you eliminate the initial insert of id = 0; you may also want to increase the number of rows selected, the number of cycles through the update loop, and the sleep times.

In principle the code emulates a classic “nested loop join done in SQL”, with one row in table t2, and a CURSOR FOR LOOP select from t1 being used to drive a select from t2 “for each row”. The peculiar predicate “t2.id = r.n2 – r.n1 + 1” inside the loop is an attempt to minimise the risk of any future release of the PL/SQL optimizer getting so clever that it puts the “constant” query for “t2.id = 1” outside the loop.

As the main loop cycles once per second, selecting the same row from t2 on each cycle, another session updates the row and commits twice every second. You might have thought that the main loop would keep returning the same value on every cycle – but each select starts at a new SCN. If you don’t take some sort of defensive action (set transaction read only, dbms_flashback) then you’ll get output similar to the following:


=================
Four second pause
=================

PL/SQL procedure successfully completed.

=====================
Starting PL/SQL block
=====================
1 - one
2 - 0000000002
3 - 0000000004
4 - 0000000006
5 - 0000000008
6 - 0000000010
7 - 0000000012
8 - 0000000014
9 - 0000000016
10 - 0000000018

PL/SQL procedure successfully completed.

I have seen many sites where this type of code is used. Typically it’s justified because the referenced table is a “very static” code table or a table that’s not supposed to change while the main loop is running – but eventually Murphy’s law will raise its ugly head.

(My favourite invocation of Murphy’s law came from my A-level physics teacher with respect to experimentation: “Constants aren’t and variables don’t”.)

October 9, 2015

PL/SQL Functions

Filed under: 12c,Oracle,Performance — Jonathan Lewis @ 6:17 pm BST Oct 9,2015

Assuming everything else has been tuned to perfection, what’s the best you can do while calling PL/SQL functions from SQL ? Here’s a little code to create a table with some data, and a function that we can use to start an investigation:


create table t1
nologging
pctfree 0
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1,1001))        n1
from
        generator       v1,
        generator       v2
where
        rownum <= 3e5
;

create or replace function func_normal(i_in number)
return number
as
begin
        return round(exp(ln(i_in)));
end;
/

That’s 300,000 rows in the table and a silly little function to use up some CPU to get nowhere. There are 1,000 distinct integer values scattered uniformly through the table, and the function returns the value it’s called with – but it does it a hard way.

Here’s some test code – with “set timing on” – followed by the results:


select count(distinct(n1)) from t1;

select 
        count(distinct(func_normal(n1)))
from    t1
;



COUNT(DISTINCT(N1))
-------------------
               1000

1 row selected.

Elapsed: 00:00:00.03

COUNT(DISTINCT(FUNC_NORMAL(N1)))
--------------------------------
                            1000

1 row selected.

Elapsed: 00:00:11.39

This is running on 12.1.0.2 on a machine with a CPU speed of aboaut 3 MHz – you may want to adjust the number of rows in the table for your own testing.

The question is, what options can we use to improve the efficiency of the calls to the PL/SQL. One option, of course, is to use the new 12c “with PL/SQL” clause – we could embed the function in the SQL like this:


with
        function func_with (
                i_in    number
        ) return number
        is
        begin
                return round(exp(ln(i_in)));
        end;
select
        count(distinct(func_with(n1)))
from    t1
/

COUNT(DISTINCT(FUNC_WITH(N1)))
------------------------------
                          1000

1 row selected.

Elapsed: 00:00:09.77

So, for the cost of copying the function into the SQL we get a 10% improvement in performance – which we could reasonably attribute to an improved efficiency in the call mechanism. There are arguments for and against copying code like this, of course, and my improvement was only 1 second for 300,000 calls, but you may decide that the benefit of the “With-PL/SQL” method is sufficient to justify the effort.

If you don’t want to copy a PL/SQL function into the SQL, though, there is another alternative – the pragma UDF (for user-defined function) – which has the effect of reducing the cost of the so-called “context switch” as you call from SQL to PL/SQL or vice versa.

Technically I think what’s happening is that the stack formats for SQL and PL/SQL are different and the context switch is the work needed to reformat the stack as you pass from one environment to the other – by declaring the function as UDF you probably set it up to deal with the incoming SQL stack itself.

Here’s how to add the UDF pragma to the function, and the results showing the effects:


create or replace function func_udf(i_in number)
return number
as
pragma UDF;
begin
        return round(exp(ln(i_in)));
end;
/

COUNT(DISTINCT(FUNC_UDF(N1)))
-----------------------------
                         1000

1 row selected.

Elapsed: 00:00:09.55

With pragma UDF the standalone function call is slightly faster than the in-line “WITH” function. I did wonder whether the pragma UDF would make the function call slower if I simply called it from a loop in a PL/SQL block, but there didn’t seem to be any significant difference between the normal function and the UDF function.

Initially, then, it looks like UDF is faster than WITH, which is faster than basic; but there are other considerations. My sample data has only 1,000 possible inputs to the function – and Oracle has three different methods for caching that I might be able to benefit from:

  • Declaring the function as deterministic
  • Putting the function into the PL/SQL result cache
  • Modifying the SQL to take advantage of scalar subquery caching

Here’s what the function declaration looks like if I want to use the PL/SQL function cache:


create or replace function func_cached(i_in number)
return number
result_cache
as
begin
        return round(exp(ln(i_in)));
end;
/

Changing my query to use func_cached() the query completed in 1.65 seconds – a clear winner, but can anything else get close.

To make the functions deterministic, I just have to add the word “deterministic” after the declaration of the return type:


create or replace function func_normal(i_in number) 
return number
deterministic
as
begin
        return round(exp(ln(i_in)));
end;
/

We can also add the deterministic keyword to the function defined in the WITH clause. Before reporting the results of testing the functions with determinism, there’s one more strategy to consider. Remove the deterministic key word from the functions, and introduce a scalar subquery to the test query, e.g.:


select
        count(distinct(select func_normal(n1) from dual))
from    t1

;

Here’s a table of results:

Method Initial Deterministic Scalar Subquery
Basic 11.39 4.30 4.32
With 9.77 9.72 3.60
UDF 9.55 9.57 3.57
Cached 1.65 0.71 0.72

Summary

Before stating any conclusions it’s worth remembering that the sample data is based on a relatively small number of distinct input values. It is the repetition that allows us to benefit from things like caching and determinism. On top of that we need to consider the scale of the time-saving in light of the number of calls made (or not made).

Having said that, the PL/SQL function cache is clearly the thing that gives us the most benefit in return for a simple implementation. We should remember that the cache is “public” – i.e. stored in the SGA – and each input value takes up another bit of the public result cache: on the plus side this means that everyone else calling this function gets the benefit of our prior calculation; on the minus side this means if we use the result cache for the wrong function then we could take up a lot of space in the cache for very little benefit. Since the relevant result cache latch has not child latches it’s also possible to end up with latch contention if the too many sessions are taking advantage of the result cache too frequently.

Although we might expect a deterministic function to give us a big benefit (in the “very repetitive inputs” case), we find that the deterministic keyword has no effect (as at 12.1.0.2) in functions declared in a WITH clause or declared as standalone with pragma UDF. Fortunately scalar subquery caching (which seems to use the same hashing algorithm as the deterministic caching algorithm) still works with WITH functions or UDF functions and (slightly surprisingly) standalone functions declared with pragma UDF seem to have a very small edge over WITH functions.

Both of the latter two approaches use a local cache with a strictly limited size. The benefit is that the locality means they won’t interfere with other sessions or end up hogging an extreme amount of a public memory area; the drawback is that the size limit (which can be adjusted with a hidden parameter) means that you can get unlucky with hash collisions and end up with extremely variable performance from day to day because of a small change in the data being processed, or even a change in the order in which an unchanged data set is processed.

Footnote

I’ll leave you with one thought. Without declaring the original function as deterministic, I ran the following query to maximise the benefit from the scalar subquery caching algorithm:


select
        count(distinct(select func_normal(n1) from dual))
from    (
        select /*+ no_eliminate_oby */  n1 from t1 order by n1
        )
;

The query completed in 0.13 seconds: the subquery was called 1,000 times (once for each distinct value – see this ancient URL), and the benefit of eliminating the function calls outweighed the cost of having to sort the data set.

Ref: 12c_function_options.sql

September 30, 2015

Estimate_percent

Filed under: CBO,Histograms,Oracle,Statistics — Jonathan Lewis @ 5:10 pm BST Sep 30,2015

Here’s a live one from OTN – here are a couple of extracts from the problem statement:

We’re experiencing an issue where it seems that the query plan changes from day to day for a particular procedure that runs once a night.
It’s resulting in a performance variance of 10 second completion time vs 20 minutes (nothing in between).
It started occurring about 2 months ago and now it’s becoming more prevalent where the bad query plan is coming up more often.
I noticed that the query plans vary for a simple query.
We do run gather statistics every night. (DBMS_STATS.GATHER_SCHEMA_STATS (ownname=>sys_context( ‘userenv’, ‘current_schema’ ), estimate_percent => 1);)

The query and two execution plans look like this:

select count(*) from cs_bucket_member_v2 where bucket_type='P' and sec_id > 0 and order_id=0;

----------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                     |     1 |    12 |   155   (0)| 00:00:02 |
|   1 |  SORT AGGREGATE              |                     |     1 |    12 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID| CS_BUCKET_MEMBER_V2 |  1148 | 13776 |   155   (0)| 00:00:02 |
|*  3 |    INDEX RANGE SCAN          | CS_BUCKET_MEMBER_N1 |  1272 |       |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("BUCKET_TYPE"='P' AND "SEC_ID">0)
   3 - access("ORDER_ID"=0)


------------------------------------------------------------------------------------------
| Id  | Operation          | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                     |     1 |    12 | 11215   (2)| 00:01:41 |
|   1 |  SORT AGGREGATE    |                     |     1 |    12 |            |          |
|*  2 |   TABLE ACCESS FULL| CS_BUCKET_MEMBER_V2 |  1522K|    17M| 11215   (2)| 00:01:41 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ORDER_ID"=0 AND "SEC_ID">0 AND "BUCKET_TYPE"='P')

There are a couple of bits of information that would be useful – such as the database version, the number of rows in the table, the number of distinct values in each column, and whether any of the columns have histograms – but there are a couple of reasonable guesses that we might make about the problem. Notice particularly that the number of rows estimated from the index ranges scan is 1272 and only a small volume is then eliminated by the table filter predicates on sec_id and bucket_type. This suggests that the optimizer has information that tells it that most of the rows in the table have sec_id > 0 and bucket_type = ‘P’, and you might note that that suggests that there’s a histogram on bucket_type.

Rather than stating the most obvious guesses about the problem, though, I’ll start by creating a data set and emulating the problem, starting from an empty schema on 11.2.0.4:

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual
        connect by 
                level <= 1e4
)
select
        rownum                  sec_id,
        case
                when mod(rownum,1000) = 0
                        then 'X'
                        else 'P'
        end                     bucket_type,
        case
                when rownum < 1e6 - 50000 
                        then mod(rownum-1,1e5)
                        else 1000
        end                     order_id,
        lpad(rownum,10,'0')     id_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6

create index t1_i1 on t1(order_id) nologging; 

select count(*) from t1 where order_id = 1000 and bucket_type = 'P' and sec_id > 1000;

The column names in the table match those needed by the query, and the bucket_p column has a very skewed distribution that will eliminate very little data; the sec_id column is also not going to eliminate data, but it’s very evenly distributed with no large gaps so not a good candidate for a histogram in any case. The order_id has 50,000 rows out of 1,000,000 (5%) set of a single value, and most of those special rows are at the end of the table – it’s a pretty good candidate for a histogram (if Oracle spots it, and if we actually write queries to access that data).

I’ve run a query that references all three columns so that the default method_opt of “for all columns size auto” will apply to them when I gather stats. So here’s the code that gathers stats and checks the resulting execution plans, first for “auto_sample_size” then for the 1% used by the OP:


set autotrace traceonly explain

begin
        dbms_stats.gather_schema_stats(
/*              estimate_percent => 1, */
                ownname          => user
        );
end;
/

select count(*) from t1 where order_id = 1000 and bucket_type = 'P' and sec_id > 1000;

begin
        dbms_stats.gather_schema_stats(
                estimate_percent => 1,
                ownname          => user
        );
end;
/

select count(*) from t1 where order_id = 1000 and bucket_type = 'P' and sec_id > 1000;

set autotrace off

And here are the two plans – in the same order:


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    12 |  2333   (4)| 00:00:12 |
|   1 |  SORT AGGREGATE    |      |     1 |    12 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   | 51063 |   598K|  2333   (4)| 00:00:12 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ORDER_ID"=1000 AND "SEC_ID">1000 AND "BUCKET_TYPE"='P')


--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |    12 |    23   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE              |       |     1 |    12 |            |          |
|*  2 |   TABLE ACCESS BY INDEX ROWID| T1    |    20 |   240 |    23   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T1_I1 |    20 |       |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):  
---------------------------------------------------
   2 - filter("SEC_ID">1000 AND "BUCKET_TYPE"='P')
   3 - access("ORDER_ID"=1000)


[Update: Following on from a question in the comments, I’ve expanded this section, and wandered a little off-topic]

I don’t know why, but with a 1% sample (which really did sample 10,000 rows) the optimizer didn’t spot the need for a histogram on order_id, but with the auto_sample_size (which sampled 5,500 – yes, half as many rows) the optimizer spotted the need for the histogram. Checking the trace files the only difference visible in the sample SQL was the presence in the 1% sample of the id_vc and padding columns which were not present in the auto_sample_size trace.

According to the manuals when the method_opt is “for all columns size auto”, then

“Oracle determines the columns on which to collect histograms based on data distribution and the workload of the columns.”

There is nothing in the manuals to suggest that there is a deliberate link between the auto_sample_size and estimate_percent, and there is room for ambiguity in how we interpret this bit of text in the manual so the difference in the SQL used and the effects thereof requires (a) some hand-waving, and/or (b) lots more experimentation.  At the moment I’m prepared to go for hand-waving:

Hypothesis 1: auto_sample_size did not sample the id_vc and padding columns because the (100%) sample taken had given Oracle enough information to decide that the data distribution of those columns was not skewed enough to merit further consideration; but it sampled the three columns that had been used in a fashion that might be helped by a histogram. This sampling spotted the benefit of a histogram on order_id and bucket_type but decided that sec_id didn’t need a histogram

Hypothesis 2: the 1% sample got pretty close to the same results in its estimates of number of distinct values for id_vc and padding as the (100%) auto_sample_size, but still decided to do a sampled test for the data distribution (the manual seems to suggest that the histograms will only be considered if there has been some use of the columns in predicates, but doesn’t explicitly preclude the possibility of creating the histogram on the basis of just the data distribution). After doing the 1% sample to analyze the data for suitability of a histogram the result suggested that only the histogram on bucket_type would be beneficial.  (In fact, after the first sample Oracle took a second 1% histogram sample on just the order_id before deciding that it a histogram on order_id wasn’t needed.)

Bottom line on this: I don’t know if the auto_sample_size “accidentally” eliminated a couple of columns from histogram sampling and if a larger fixed sample size (say 50%, or even 100%) might result in Oracle eliminating a few columns from the histogram; or maybe the code path for histogram samples with auto_sample_size in place is actually a different code path. The only thing I can say is that the two sets of events that appeared from my demonstration don’t seem to be entirely self-consistent, but it would probably take most of a day doing experiments to narrow down the variation in behaviour to a few “best guess” ideas of what’s going on behind the scenes – though unwrapping the code might lead to a more accurate answer more quickly.

Moral

Histograms are tricky things – and you can only make things worse in 11g by NOT using the auto_sample_size.

Footnote

Based on previous experience – my “obvious” guess about the OP’s data was that there was a special-case value for order_id, that the rows for that value were fairly well clustered, probably towards the end of the table, and constituted a small percentage of the table, and that the rest of the data reported “a few” rows per value. That’s why I built the model you see above.

« Previous PageNext Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 6,449 other followers