Oracle Scratchpad

January 26, 2016

Trace file size

Filed under: 12c,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 8:30 am GMT Jan 26,2016

Here’s a convenient enhancement for tracing that came up on Twitter a few days ago – first in a tweet that I retweeted, then in a question from Christian Antognini based on this bit of the 12c Oracle documentation (opens in separate tab). The question was – does it work for you ?

The new description for max_dump_file_size says that for large enough values Oracle will split the file into multiple chunks of a few megabytes, using a suffix to identify the sequence of the chunks, keeping only the first chunk and the most recent chunks. Unfortunately this doesn’t seem to be true. However, prompted by Chris’ question I ran a quick query against the full parameter list looking for parameters with the word “trace” in their name:


select
        /*+
                leading(nam val val2)
                full(name)
                full(val)  use_hash(val)  no_swap_join_inputs(val)
                full(val2) use_hash(val2) no_swap_join_inputs(val2)
        */
        nam.ksppinm                             name,
        val.ksppstvl                            ses_val,
        val2.ksppstvl                           sys_val,
        nam.ksppdesc                            description,
        nam.indx+1                              numb,
        nam.ksppity                             type,
        val.ksppstdf                            is_def,
        decode(bitand(nam.ksppiflg/256,1),
                1,'True',
                  'False'
        )                                       ses_mod,
        decode(bitand(nam.ksppiflg/65536,3),
                1,'Immediate',
                2,'Deferred' ,
                3,'Immediate',
                  'False'
        )                                       sys_mod,
        decode(bitand(val.ksppstvf,7),
                1,'Modified',
                4,'System Modified',
                  'False'
        )                                       is_mod,
        decode(bitand(val.ksppstvf,2),
                2,'True',
                  'False'
        )                                       is_adj,
        val.ksppstcmnt                          notes
from
        x$ksppi         nam,
        x$ksppcv        val,
        x$ksppsv        val2
where
        nam.indx = val.indx
and     val2.indx = val.indx
and     ksppinm like '%&m_search.%'
order by
        nam.ksppinm
;

Glancing through the result I spotted a couple of interesting parameters with the letters “uts” in their names, so re-ran my query looking for all the “uts” parameters, getting the following (edited) list:


NAME                           SYS_VAL         DESCRIPTION    
------------------------------ --------------- ---------------------------------------------
_diag_uts_control              0               UTS control parameter
_uts_first_segment_retain      TRUE            Should we retain the first trace segment
_uts_first_segment_size        0               Maximum size (in bytes) of first segments 
_uts_trace_disk_threshold      0               Trace disk threshold parameter
_uts_trace_segment_size        0               Maximum size (in bytes) of a trace segment
_uts_trace_segments            5               Maximum number of trace segments 

Note particularly the “first segment size” and “trace segment size” – defaulting to zero (which often means a hidden internal setting, though that doesn’t seem to be the case here, but maybe that’s what the “diag control” is for). I haven’t investigated all the effects, but after a little experimentation I found that all I needed to do to get the behaviour attributed to max_dump_file_size was to set the following two parameters – which I could do at the session level.


alter session set "_uts_first_segment_size" = 5242880;
alter session set "_uts_trace_segment_size" = 5242880;

The minimum value for these parameters is the one I’ve shown above (5120 KB) and with the default value for _uts_trace_segments you will get a maximum of 5 trace files with sequential names like the following:

ls -ltr *4901*.trc

-rw-r----- 1 oracle oinstall 5243099 Jan 26 08:15 orcl_ora_4901_1.trc
-rw-r----- 1 oracle oinstall 5243064 Jan 26 08:15 orcl_ora_4901_12.trc
-rw-r----- 1 oracle oinstall 5243058 Jan 26 08:15 orcl_ora_4901_13.trc
-rw-r----- 1 oracle oinstall 5242993 Jan 26 08:15 orcl_ora_4901_14.trc
-rw-r----- 1 oracle oinstall 1363680 Jan 26 08:15 orcl_ora_4901.trc

As you can see I’m currently generating my 15th trace, and Oracle has kept the first one and the previous three. It’s always working on a file with no suffix to its name but as soon as that file hits its limiting size (plus or minus a few bytes) it gets its appropriate suffix, the oldest file is deleted, and a new trace file without a suffix is started.

Apart from the usual header information the trace files start and end with lines like:

*** TRACE CONTINUED FROM FILE /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_11.trc ***
  
*** TRACE SEGMENT RENAMED TO /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_12.trc ***

There is one little trap to watch out for: if you set either of these parameters to be larger than max_dump_file_size tracing stops as soon as one of the segments hits the max_dump_file_size and that trace file ends with the usual “overflow” message – e.g, when I changed the max_dump_file_size to 4M in mid-session:

*** DUMP FILE SIZE IS LIMITED TO 4194304 BYTES ***

In my case I had started with max_dump_file_size set to 20M, so I got lucky with my choice of 5M as the segment size.

Further investigation is left as an exercise to the interested reader.

 

December 21, 2015

Parallel Plans

Filed under: 12c,Oracle,Parallel Execution — Jonathan Lewis @ 8:40 am GMT Dec 21,2015

This article was prompted by a pair of articles by Yasin Baskan of Oracle Corporation: PX Server Sets etc. and Multiple Parallelizers, plus a little extra prompting from a mistake that I made when reading the second of those two articles. The fact that I made a mistake is significant because, without it, I wouldn’t have created a model to check Yasin’s description of the parallel activity.

I want to examine the following query to find out the order of activity:


select
        distinct t1.n1
from
        t1
where 
        t1.n1 >= 500
and     t1.n2 > (select avg(t2.n2) from t2)
and     t1.n3 > (select avg(t3.n3) from t3)
;

I’m going to be using 12.1.0.2 for this demonstration (lots of changes appear in parallel execution as you move from 11g to 12c, so the version is most significant), with all default settings for the parallel-related parameters.

To get things started, here’s the code to generate the data that I used, and the serial execution plan:


create table t1 nologging as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1000)        n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4
;

create table t2 nologging as
with generator as ( 
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4
;

create table t3 nologging as
with generator as ( 
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4 ;

-- don't need to gather stats as they are generated on CTAS in 12c.
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   150 (100)|          |
|   1 |  HASH UNIQUE         |      |    24 |   336 |   150   (4)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T1   |    25 |   350 |    51   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 20000 |    97K|    49   (3)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T3   | 20000 |    97K|    49   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">=500 AND "T1"."N2"> AND "T1"."N3">))

I’ve pulled the execution plan from memory using dbms_xplan.display_cursor(), which is why the predicate section shows the odd comparison of “t2.n2 > {nothing there}” and “t3.n3 > {nothing there}”. The predicate section produced by a simple call to explain plan would help us to fill in the details:


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."N1">=500 AND "T1"."N2"> (SELECT AVG("T2"."N2") FROM
              "T2" "T2") AND "T1"."N3"> (SELECT AVG("T3"."N3") FROM "T3" "T3"))

We can infer from the predicate section that Oracle executes the t2 subquery first and the t3 subquery second; but there is a slight oddity in the shape of the execution plan if we apply the “first child first / recursive descent” approach to reading the plan. It looks as if operations 3 and 5 are children of line 2 and should be run before child 2. This is the consequence of “subquery pushing”. In fact the subqueries are both filter subqueries and, in principle (and ignoring scalar subquery caching for the moment), they will be run once for each row produced from the tablescan of operation 2.

The activity is as follows:

  • start scanning table t1
  • when the first row appears where n1 >= 500 run the subquery against t2 to derive the average of t2.n2 (10000.5) and check to see if the n2 value of the current t1 row exceeds that value (it doesn’t);
  • continue scanning t1 checking for rows where n1 >= 500 and n2 >= 10000.5 (no need to rerun the subquery thanks to scalar subquery caching),
  • on finding the first row matching both predicates run the subquery against t3 to derive the average of t3.n3 (10000.5) and check to see if the n3 value of the current t1 row exceeds that value.
  • continue the scan of t1 checking for rows where n1 >= 500 and n2 >= 10000.5 and n3 >= 10000.5 (again scalar subquery caching means no further executions of either subquery).

We can add weight to the argument that this is the order of activity by flushing the buffer cache and enabled the extended trace before running the query.  Here’s a section of a trace file that demonstrates the point – with no edits beyond the insertion of a few blank lines:


PARSE #140345652175312:c=1000,e=1813,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=115251305,tim=134216063372
EXEC #140345652175312:c=0,e=62,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=115251305,tim=134216063544
WAIT #140345652175312: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=134216063569
WAIT #140345652175312: nam='db file sequential read' ela= 9 file#=5 block#=131 blocks=1 obj#=133974 tim=134216063627
WAIT #140345652175312: nam='db file scattered read' ela= 315 file#=5 block#=132 blocks=112 obj#=133974 tim=134216064187

WAIT #140345652175312: nam='db file sequential read' ela= 9 file#=5 block#=515 blocks=1 obj#=133975 tim=134216064438
WAIT #140345652175312: nam='db file scattered read' ela= 115 file#=5 block#=516 blocks=112 obj#=133975 tim=134216064732
WAIT #140345652175312: nam='db file scattered read' ela= 18 file#=5 block#=628 blocks=12 obj#=133975 tim=134216066046
WAIT #140345652175312: nam='db file scattered read' ela= 123 file#=5 block#=642 blocks=112 obj#=133975 tim=134216066548
WAIT #140345652175312: nam='db file scattered read' ela= 20 file#=5 block#=754 blocks=14 obj#=133975 tim=134216067760
WAIT #140345652175312: nam='db file scattered read' ela= 124 file#=5 block#=770 blocks=112 obj#=133975 tim=134216068153
WAIT #140345652175312: nam='db file scattered read' ela= 8 file#=5 block#=882 blocks=2 obj#=133975 tim=134216069677

WAIT #140345652175312: nam='db file scattered read' ela= 18 file#=5 block#=244 blocks=12 obj#=133974 tim=134216070049
WAIT #140345652175312: nam='db file scattered read' ela= 127 file#=5 block#=258 blocks=112 obj#=133974 tim=134216070382

WAIT #140345652175312: nam='db file sequential read' ela= 6 file#=5 block#=899 blocks=1 obj#=133976 tim=134216070812
WAIT #140345652175312: nam='db file scattered read' ela= 41 file#=5 block#=900 blocks=112 obj#=133976 tim=134216071028
WAIT #140345652175312: nam='db file scattered read' ela= 42 file#=5 block#=1012 blocks=12 obj#=133976 tim=134216072956
WAIT #140345652175312: nam='db file scattered read' ela= 215 file#=5 block#=1026 blocks=112 obj#=133976 tim=134216073642
WAIT #140345652175312: nam='db file scattered read' ela= 29 file#=5 block#=1138 blocks=14 obj#=133976 tim=134216076334
WAIT #140345652175312: nam='db file scattered read' ela= 154 file#=5 block#=1154 blocks=112 obj#=133976 tim=134216076825
WAIT #140345652175312: nam='db file scattered read' ela= 14 file#=5 block#=1266 blocks=2 obj#=133976 tim=134216081673

WAIT #140345652175312: nam='db file scattered read' ela= 24 file#=5 block#=370 blocks=14 obj#=133974 tim=134216082815
WAIT #140345652175312: nam='db file scattered read' ela= 144 file#=5 block#=386 blocks=112 obj#=133974 tim=134216083232
WAIT #140345652175312: nam='db file scattered read' ela= 20 file#=5 block#=498 blocks=12 obj#=133974 tim=134216084494

FETCH #140345652175312:c=16998,e=21096,p=1105,cr=1111,cu=0,mis=0,r=1,dep=0,og=1,plh=115251305,tim=134216084683

Object 133974 is table t1, object 133975 is table t2, and object 133976 is table t3.

As you can see, we start by scanning t1, then we scan t2 once, then we scan more of t1, then we scan t3 once, then we finish off scanning t1.

We could chase the detail a little further, of course – we could check the session statistics to make sure that we haven’t been using the buffer cache to scan t2 and t3 repeatedly, or we could enable rowsource execution statistics to report the number of times each operation started, or we could check the STAT# lines dumped a little later in the trace files and infer that the t2 and t3 tablescans happened just once each.

At this point I’ll mention my mistake: when I originally read Yasin’s version of the parallel equivalent of this plan my first thought was that the subquery sections of the plan operate from the bottom upwards (i.e. another one of those examples where “first child first” doesn’t apply) whereas Yasin described them as running top down. I was wrong – I was thinking of the case where the optimizer analyses the query from the bottom up, but it still produces a plan that then runs from the top down – but I didn’t realize my mistake until after I’d been poking around with the parallel version of the query and had made a surprising discovery.

Parallel Execution

So what happens when we manipulate this query into parallelism ? To make it a little easier to see some of the details I’ve added hints to each query block to use a different degree of parallelism; and then, because it all happens so fast, I’ve introduced a function call to slow down the progress of the query. Here’s the function I’ve created (you’ll find it elsewhere on blog) and the final query I started testing with:


create or replace function wait_row (
        i_secs  number,
        i_return        number
) return number
parallel_enable
is
begin
        dbms_lock.sleep(i_secs);
        return i_return;
end;
/


select
        /*+ parallel(t1,5) tracking */ 
        distinct t1.n1
from
        t1
where   
        wait_row(0.01,t1.n2) > (select /*+ parallel(t2,3) */ avg(wait_row(0.01,t2.n2)) from t2)
and     wait_row(0.01,t1.n3) > (select /*+ parallel(t3,2) */ avg(wait_row(0.01,t3.n3)) from t3)
and     t1.n1 >= 1000
--      and     wait_row(0.01, t1.n1) >= 500
;

The wait_row() function simply waits for the number of seconds given in the first parameter then returns the (numeric) value of the second parameter. It calls the dbms_lock.sleep() function which has a granularity of 1/100th of a second, and since I’m planning to use it in a parallel query I’ve included the parallel_enable declaration.

Since the function has not been declared deterministic, or assigned to the pl/sql result cache, and since there are 20,000 rows in the t2 table the time to run the t2 subquery will be roughly 20,000 * 0.01 seconds, which (ignoring parallelism) would be about 200 seconds. I’ve actually hinted DOP = 3 for that subquery so I hope it will complete in about 70 seconds, and the subquery against t3 with hinted DOP = 2 should complete in about 100 seconds.

You’ll notice I have two possible predicates against t1 – the one shown uncommented here doesn’t use call the wait_row() function, the other does. Here’s an interesting point, though, about the example that doesn’t call wait_row() – the simple predicate is going to eliminate ALL the data from t1, there are no rows where n1 >= 1000 and that’s an important point to remember.

Here’s the execution plan:


------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       |    57 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)          | :TQ30001 |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | P->S | QC (RAND)  |
|   3 |    HASH UNIQUE                 |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | PCWP |            |
|   4 |     PX RECEIVE                 |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | PCWP |            |
|   5 |      PX SEND HASH              | :TQ30000 |     1 |    14 |    57   (4)| 00:00:01 |  Q3,00 | P->P | HASH       |
|   6 |       HASH UNIQUE              |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR       |          |     1 |    14 |    11   (0)| 00:00:01 |  Q3,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL      | T1       |     1 |    14 |    11   (0)| 00:00:01 |  Q3,00 | PCWP |            |
|   9 |          SORT AGGREGATE        |          |     1 |     5 |            |          |  Q3,00 | PCWP |            |
|  10 |           PX COORDINATOR       |          |       |       |            |          |        |      |            |
|  11 |            PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|  12 |             SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|  13 |              PX BLOCK ITERATOR |          | 20000 |    97K|    18   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|* 14 |               TABLE ACCESS FULL| T2       | 20000 |    97K|    18   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  15 |          SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|  16 |           PX COORDINATOR       |          |       |       |            |          |        |      |            |
|  17 |            PX SEND QC (RANDOM) | :TQ20000 |     1 |     5 |            |          |  Q2,00 | P->S | QC (RAND)  |
|  18 |             SORT AGGREGATE     |          |     1 |     5 |            |          |  Q2,00 | PCWP |            |
|  19 |              PX BLOCK ITERATOR |          | 20000 |    97K|    27   (0)| 00:00:01 |  Q2,00 | PCWC |            |
|* 20 |               TABLE ACCESS FULL| T3       | 20000 |    97K|    27   (0)| 00:00:01 |  Q2,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N1">=1000 AND "WAIT_ROW"(.01,"T1"."N2")> AND "WAIT_ROW"(.01,"T1"."N3")>))
  14 - access(:Z>=:Z AND :Z<=:Z) 20 - access(:Z>=:Z AND :Z<=:Z)

The odd “:Z>=:Z AND :Z <=:Z” predicate is an artefact of rowid-range scans in parallel tablescans, and we’ve seen the “disappearing subquery” problem with predicates pulled from memory so that’s ignorable. The interesting point is that the “T1.N1 >= 1000” predicate will eliminate all the data from t1 very quickly – so why did it take roughly 3 minutes and 30 seconds to run the query ?

Quick Answer – Oracle ran the two subqueries before it started the t1 tablescan !

Slow Answer – I need to start by filling in a lot of background information.

Notice how we have multiple DFO Trees in this execution plan. In the TQ column these are captured as (Q1,xx) (Q2,xx) and (Q3,xx).  In previous blogs I’ve explained how you can follow the order of activity by following the table queues (effectively the “xx” – also known as virtual tables, also known as Data Flow Operations) in order, but I haven’t explained how you deal with multiple DFO trees. It would be nice to think that DFO Trees implicitly told you the order of operation – unfortunately that’s not true (at least in current versions of Oracle), and this query demonstrates the point.

Table Queues, DFOs (data flow operations), and DFO Trees

The first of Yasin Baskan’s two articles describes the terminology to use when describing the movement of data through a parallel execution plan, but for the sake of familiarity I’ll walk through the plan above using the appropriate terms.  (Note: once upon a time the term “DFO tree” didn’t exist and the term DFO was used to describe what is now called a DFO tree, and a number of articles I’ve written in the past echo this out-dated usage).

Lines 1 – 8 show a DFO tree consisting of two DFOs – the tree is labelled as Q3. The first DFO in the tree (DFO 0) is described in lines 5 – 8 where one set of parallel execution slaves scans real table t1 and sends a data set to the other set parallel execution slaves by writing into a virtual table (:TQ30000); the second DFO in the tree (DFO 1) is described in lines 1 – 4 where a set of parallel execution slaves reads from virtual table :TQ30000, does some work and sends a data set to the query co-ordinator by writing to virtual table :TQ30001.

Lines 9 – 14 show a second DFO tree consisting of a single DFO – this tree is labelled Q1;

Lines 15 – 20 show us a third DFO tree, labelled Q2, also consisting of a single DFO.

Essentially (as described by Yasin Baskan) the focal point of a single DFO (data flow operation) is a PX SEND, and the top of a DFO tree is a PX Coordinator.

The key to performance problems and interpreting tricky parallel execution plans is the fact that in the extreme case every DFO tree may operate at a different degree of parallelism, and all the DFO trees may be operating concurrently – an unlucky execution plan may acquire far more parallel execution slaves than you expect.

So what’s the order of execution in this query, how do we find out, how many parallel query slaves will we start (and stop) as the query runs, and what’s the maximum number of parallel query slaves that we will be holding concurrently ? There are lots of places you could look:

  • v$pq_tqstat might give you some idea of processes used and data moved – after you’ve run the query, but it’s a bit broken for multiple DFO trees.
  • v$px_sesstat will give you some information about the work done by currently live slave processes if you can query it while your query is still running (“set pause on” may help).
  • v$px_session joined to v$session_event will give you some information about the time lost to waits for currently live slave processes if you can query them while your query is still running (it would be nice if Oracle were to introduce a “v$px_sesevent”)
  • The SQL Monitor screen in OEM (or the dbms_sql_tune.report_sql_monitor() procedure is fantastic (but a little broken) if you’ve paid the licences for the diagnostic and performance packs.

Here’s part of the text output from a call to dbms_sql_tune.report_sql_monitor() for a slightly different query – one that doesn’t include any predicate at all on the t1 table:


SQL Plan Monitoring Details (Plan Hash Value=3828285674)
==================================================================================================================================================
| Id |            Operation             |   Name   |  Rows   | Cost |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                                  |          | (Estim) |      | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT                 |          |         |      |         1 |   +256 |    11 |     1000 |       |          |                 |
|  1 |   PX COORDINATOR                 |          |         |      |         1 |   +256 |    11 |     1000 |       |          |                 |
|  2 |    PX SEND QC (RANDOM)           | :TQ30001 |      49 |   60 |         1 |   +256 |     5 |     1000 |       |          |                 |
|  3 |     HASH UNIQUE                  |          |      49 |   60 |         1 |   +256 |     5 |     1000 |    7M |          |                 |
|  4 |      PX RECEIVE                  |          |      49 |   60 |         1 |   +256 |     5 |     4506 |       |          |                 |
|  5 |       PX SEND HASH               | :TQ30000 |      49 |   60 |         1 |   +256 |     5 |     4506 |       |          |                 |
|  6 |        HASH UNIQUE               |          |      49 |   60 |        68 |   +189 |     5 |     4506 |    7M |          |                 |
|  7 |         PX BLOCK ITERATOR        |          |      50 |   14 |        68 |   +189 |     5 |    10000 |       |          |                 |
|  8 |          TABLE ACCESS FULL       | T1       |      50 |   14 |        68 |   +189 |    63 |    10000 |       |          |                 |
|  9 |           SORT AGGREGATE         |          |       1 |      |         1 |   +187 |     1 |        1 |       |          |                 |
| 10 |            PX COORDINATOR        |          |         |      |         1 |   +187 |     4 |        3 |       |          |                 |
| 11 |             PX SEND QC (RANDOM)  | :TQ10000 |       1 |      |         3 |   +185 |     3 |        3 |       |          |                 |
| 12 |              SORT AGGREGATE      |          |       1 |      |        74 |   +114 |     3 |        3 |       |          |                 |
| 13 |               PX BLOCK ITERATOR  |          |   20000 |   18 |        74 |   +114 |     3 |    20000 |       |          |                 |
| 14 |                TABLE ACCESS FULL | T2       |   20000 |   18 |        74 |   +114 |    37 |    20000 |       |          |                 |
| 15 |           SORT AGGREGATE         |          |       1 |      |         1 |   +118 |     1 |        1 |       |          |                 |
| 16 |            PX COORDINATOR        |          |         |      |         1 |   +118 |     3 |        2 |       |          |                 |
| 17 |             PX SEND QC (RANDOM)  | :TQ20000 |       1 |      |         2 |   +111 |     2 |        2 |       |          |                 |
| 18 |              SORT AGGREGATE      |          |       1 |      |       111 |     +2 |     2 |        2 |       |          |                 |
| 19 |               PX BLOCK ITERATOR  |          |   20000 |   27 |       111 |     +2 |     2 |    20000 |       |          |                 |
| 20 |                TABLE ACCESS FULL | T3       |   20000 |   27 |       111 |     +2 |    26 |    20000 |       |          |                 |
==================================================================================================================================================

There’s a little inaccuracy in timings – in the order of the odd second or two – probably because of the granularity used, but the columns “Start Active” and “Time Active (s)” are, nevertheless, very revealing. The first one tells you when a line in the plan first did some work, the second one tells you how long that line has been active. As I walk through the plan remember that the timing granularity is fairly crude, so the numbers don’t quite add up as nicely as we might like.

If you check lines 17 – 20 you can see that DFO tree 2 (:TQ2xxxx) was the first thing to go active with its PX slaves starting at time t = 2 seconds and running for 111 seconds to do that scan and initial aggregation; the PX send line started at time t = 111 seconds and ran for 2 seconds to pass the data from the slaves to the coordinator.

There’s a little glitch (or hand waving moment) around lines15 and 16 where the PX Coordinator receives and aggregates data from the PX slaves – the co-ordinator seems to start doing this several seconds later than it should.

In lines 11 – 14 you can see that DFO tree 1 (:TQ1xxxx) was the second thing to go active with its PX slaves starting at time t = 114 (i.e. just after DFO tree 2 completes);  the full scan and initial aggregate ran for 74 seconds (taking us to time t = 188). The PX send started (a little early, this time) at t = 185 and ran for 3 seconds, with the co-ordinator starting at t = 187 and taking one second to receive and do the final aggregation.

In lines 2 – 8 we can see DFO tree 3 (:TQ3xxxx) starting up, but this DFO tree includes two DFOs. One set of slaves scans table t1 applying the predicates to filter the data then does the initial hash distinct before hash distributing the intermediate result to the next set of slaves that finish off the distinct aggregation.

When DFO :TQ30000 starts at t = 189, the previous DFO trees have completed and the PX coordinator has the results of the two subqueries which it passes to the slave set which can now scan and check for: “n1 >= 500 and  n2 > {derived constant} and n3 > {derived constant}”. The scan and initial aggregation takes 68 seconds completing at t = 255, and at t = 256 the results are distributed to the second set of PX slaves. The second set of slaves has virtually nothing to do and reports only 1 second of execution time before passing the data (DFO :TQ30001) to the query coordinator which does the final aggregation and report.

A key point to note is that this timing information tells us that (a) the subqueries both completed before the tablescan of t1 started, and – perhaps more importantly – that we cannot rely on the TQXnnnn numbers to tell us the order of operation of the DFO trees. In this example DFO tree 2 ran first, DFO tree 1 ran second, and DFO tree 3 ran third – some queries might happen to show the names and the order in synch, that would be a coincidence, not a deliberate design feature.

A little reminder, though – even though we cannot rely on the DFO tree names to tell us the ordering in which the DFO trees operate, when we look at the data flow operations inside the trees (i.e. the last few digits of a TQ name) the order of operation withing a given tree matches the sequence indicated by the TQ name.

Counting Slaves.

There’s more in the Monitor report that can help us understand the workload – and in this case I’ve got a lucky coincidence to help, and a little surprise to go with it.  For parallel queries the report includes a summary of parallel execution activity, and this is what I got from my example:

Parallel Execution Details (DOP=5 , Servers Allocated=15)
===========================================================================================================================
|      Name      | Type  | Group# | Server# | Elapsed |   Cpu   | Concurrency | PL/SQL  |  Other   | Buffer | Wait Events |
|                |       |        |         | Time(s) | Time(s) |  Waits(s)   | Time(s) | Waits(s) |  Gets  | (sample #)  |
===========================================================================================================================
| PX Coordinator | QC    |        |         |    0.03 |    0.02 |             |         |     0.02 |     15 |             |
| p00a           | Set 1 |      1 |       1 |    0.27 |    0.27 |             |    0.10 |          |    163 |             |
| p00b           | Set 1 |      1 |       2 |    0.25 |    0.25 |             |    0.09 |          |    156 |             |
| p00c           | Set 1 |      1 |       3 |    0.26 |    0.26 |             |    0.09 |          |    156 |             |

| p00a           | Set 1 |      2 |       1 |    0.43 |    0.43 |             |    0.15 |          |    221 |             |
| p00b           | Set 1 |      2 |       2 |    0.44 |    0.44 |             |    0.15 |          |    221 |             |

| p000           | Set 1 |      3 |       1 |    0.00 |    0.00 |             |         |          |        |             |
| p001           | Set 1 |      3 |       2 |    0.00 |    0.00 |             |         |          |        |             |
| p002           | Set 1 |      3 |       3 |    0.01 |    0.01 |             |         |          |        |             |
| p003           | Set 1 |      3 |       4 |    0.01 |    0.01 |             |         |          |        |             |
| p004           | Set 1 |      3 |       5 |    0.01 |    0.01 |             |         |          |        |             |
| p005           | Set 2 |      3 |       1 |    0.26 |    0.26 |             |    0.09 |          |    113 |             |
| p006           | Set 2 |      3 |       2 |    0.24 |    0.24 |        0.00 |    0.08 |          |    108 |             |
| p007           | Set 2 |      3 |       3 |    0.25 |    0.25 |             |    0.09 |          |    117 |             |
| p008           | Set 2 |      3 |       4 |    0.25 |    0.25 |             |    0.09 |          |    108 |             |
| p009           | Set 2 |      3 |       5 |    0.25 |    0.25 |        0.00 |    0.09 |          |    117 |             |
===========================================================================================================================

I’ve inserted two blank lines breaking the summary down into three separate groups, which you can identify by the heading “Group#”. In this table we see Group 1 has one slave set of three slaves – which corresponds to the tablescan of t2; Group 2 consists of one slave set of two slaves – which corresponds to the tablescan of t3; and Group 1 has two slave sets of 5 slaves each – which correspond to the tablescan and aggregation of t1. The Group numbers appear to align correctly with the DFO tree numbers.

Another detail that stands out from this list is that slaves p00a and p00b are used in the tablescan of t3, and in the tablescan of t2 (where they are joined by slave p00c). In this example Oracle has reused the slaves from one subquery to run the next. However we can also see that the slaves p000p009 that are used for the tablescan of t1 don’t have names that overlap with the slaves used to scan t2 and t3 – which tells use that we have some (in this case two) DFO trees running concurrently.

Another key point in this article is that not only does a DOP of N mean we could run 2*N slaves concurrently in a parallel query, if we have a query that breaks into a plan that uses multiple DFO trees we might 2 * N * {number of DFO trees) allocated and running concurrently. (And, as we see here, it’s even possible that different DFO trees run with different DOPs, leaving Oracle very confused about what to report as the DOP of the query – dbms_xplan actually reported this one as DOP = 4 !)

A final implementation detail that is not necessarily obvious from this table, but which I can infer because I know the history of what was going on: although, as described above, the tablescan of t3 was the first activity that generated a row source, Oracle started up the 10 slaves for the parallel tablescan and aggregate of t1 before it started the two slaves it needed to scan t3. I can infer this because there were no slave processes active when I started running the query, so it is reasonable to assume that the slave names tell me something about the order in which they were allocated. In general you would not be able to notice this in a busy system where slaves were pre-allocated (parallel_min_servers != 0) and constantly being acquired and released. (I was also able to corroborate some of the details above by looking at v$px_stat and v$session_event for the relevant slave processes shortly after the query started.)

Summary

There are a number of key points this example shows us about complex parallel queries:

  • A single query can produce multiple “DFO trees”
  • Each tree can run at a different degree of parallelism
  • The “tree number” (the X in TQXnnnn) may not reveal the order in which the trees are operated
  • The order of operation of the DFOs within a tree is revealed by the nnnn in TQXnnnn
  • It is not easy to see in a basic execution plan which DFO tree might be executing concurrently and which consecutively
  • PX server processes may be acquired by the query co-ordinator a long time before they actually become active

Footnote:

If you happen to have looked closely at the time information in the Parallel Execution Details you’ll note that the times are total rubbish when compared to the monitoring times. It looks as if the time spent in dbms_lock.sleep() has been ignored (perhaps as an “idle” wait).

 

December 10, 2015

Upgrades

Filed under: 12c,Oracle,Upgrades — Jonathan Lewis @ 8:42 am GMT Dec 10,2015

One of the questions that came up at the Optimizer Round Table this year was about minimizing the performance-related** hassle of upgrading from 11g to 12c. Dealing with changes in the optimizer is always an an interesting problem but in 12c this is made more challenging because of the automatic dynamic sampling that can introduce a significant amount of extra work at (hard) parse time, then generate SQL Directives, and finally generate extended (column group) statistics the next time you (or the automatic job) collect stats.

Of course one of the traditional strategies for upgrades (especially if you don’t really have a technical need to, but want your Oracle version to be current) is to set the optimizer_features_enable parameter to the older version and hope that this stops most of the new features from appearing, letting other customers identify the problems and raise the SRs that result in the next patch set.

There’s also the option for reading the manuals and white-papers carefully and identifying the new and enhanced optimizer features that might cause you problems if only you can spot them in time in your code. (The enhancement to subquery unnesting I described in my previous post is such an example – do you know of any code in your application that was carefully engineered to maximise the benefit of scalar subquery caching, if so you should check very carefully that 12c isn’t going to transform away your subquery and wreck your performance.)

As a general strategy, though, it’s worth reviewing your code for four generic features – histograms, multi-column joins, predicates that apply a function to a column, and “out-of-range” issues. The first three are features that are likely to make the 12c optimizer decide to do something “dynamic” that may ultimately give you a better execution plan, but may cost you extra resources or result in increased contention at just the wrong moment; the fourth is one that can disrupt the solution to the multi-column problem.

For the purposes of this note I am going to assume that you don’t use partitioned table or that you have already devised a programmatic method for minimising the work while maximising the effectiveness of their stats (and that your programs unlock and relock the stats so that they aren’t re-gathered by accident).

Histograms

I’m assuming by now that you’re using the approximate_ndv method with auto_sample_size to gather object statistics. In 11g this still means that any histograms gathered by Oracle will use a (surprisingly small) sample and even if you are gathering the histograms at the right time they could introduce a lot of instability to execution plans.

In 12c, Oracle can use an approximate method to create a frequency or Top-N frequency histogram while doing the scan that generates all the other stats – so if you have columns where you know the number of distinct values is relatively small (default up to 254, though technically up to 2048) or that almost all the data (except, broadly speaking, one bucket’s worth) comes from a relatively small number of distinct values then, on the upgrade, you should be safe creating a frequency or Top-N histogram once on those columns and setting the method_opt to ‘for all columns size repeat’.

Oracle 12c still uses a small sample size, though, when it detects the need for a “hybrid” histogram – which is the thing that replaces the height-balanced histogram. This means you may still need to write code to generate the histogram rather than allowing Oracle to collect it. It’s the classic compromise problem – a good histogram will be expensive to gather, a cheaply gathered histogram could easily be inaccurate and unstable. In this case you may want to create a Top-N frequency histogram that is a good model of the data, rather than trying to create a correctly structured hybrid. For tables with such columns you need code that can detect changes to the stats and recreates the histogram. In fact, even in cases where Oracle could gather a (frequency) histogram efficiently and accurately, you may still want to detect changes to stats for some columns and write code to create a histogram that describes the data the way it will look when the users are accessing it rather than the way the data looked when Oracle created the histogram (at 2:00 a.m. or whenever the automatic stats collection job hit it).

Broad strategy: set the global preference to method_opt => ‘for all columns size 1’, set a few table preferences to ‘for all columns size repeat’, and have code that checked the last_analyzed date on a few tables tables and recreates the histogram you want immediately after the stats have been gathered.

Note: as a general guideline, if you know the database needs histograms to produce the most appropriate execution plans this means the front-end code has to co-operate with the database and not conceal useful information in SQL that uses bind variables that hide the significance of special values or ranges – but that’s a topic for another blog note (if I haven’t written anything about it before).

Virtual Columns and Column Groups

In 12c column expressions (where the optimizer guesses) and multi-column predicates (where the optimizer assumes independence) are key triggers to dynamic sampling and SQL Directives. To engage in a pre-emptive strike the ideal is to know the code and know the data. Search for “important” SQL that applies functions (like trunc(date_col)) to columns in predicates or does multi-column (equality) joins or uses filter predicates that reference multiple columns from the same table.

If the execution plans you find for these statements produce estimates which are clearly far from reality then you may need to take pre-emptive action, even (or, perhaps, especially) if the resulting plans look good. The optimizer may (for example) know that it is guessing when it says to itself: “trunc(delivery_date) = trunc(sysdate) will return 1% of the data” and do several things such as create an adaptive execution plan and switch plans in mid-execution, do a load of dynamic sampling to discover the guess was wrong, dump an SQL Plan Directive into the data dictionary that triggers subsequent dynamic sampling and then create a virtual column through the extended stats mechanism.

In cases like this you may want to create and document virtual columns explicitly, and create column groups explicitly before Oracle does its dynamic thing. In the case of the column groups, you only have to worry about equality predicates, a column group cannot be used with range-based predicates. Remember you are only allowed a maximum of 20 column groups per table (although there is, effectively, no limit on the number of virtual columns) so you need to make your choices of columns groups before Oracle goes onto auto-pilot.

Remember, there’s a trap waiting here if you don’t make this effort.  You might run a test suite a couple of times and fix everything that seems to be necesary without realising that in the course of your testing Oracle has created a couple of hundred column groups and virtual columns. Then, when everything is working nicely, you upgrade the production system and all hell breaks loose until the production system has generated the same (we hope) set of virtual columns and column groups. Since there is a limit of 20 column groups per table you could get unlucky and find that Oracle wants to create 30 columns groups and the 20 that happen to be created first on production may not be the ones it created on the test system.

Out of Range

When a predicate specifies a value that is above the known high value or below the known low value for a column, the optimizer uses a “linear decay” method for cardinality calculations that is based on how far out of the range the value is (as a percentage of the range). In many cases this makes introduces a slowly increasing error (though for some predicates it can create an immediate catastrophe). There is a special consideration, though,  that means you need to be very careful about time-based or sequence-based columns that can go out of range: if you have a column group that includes the column then the optimizer stops using the column group the moment you go out of range. This is probably a rare condition to check for because you probably use range-based predicates on such columns (e.g. “order_date < trunc(sysdate) – 7”) and column groups are not applicable to range-based predicates anyway, but you do need to be aware that in special cases an execution plan can go bad the moment you overshoot the high value.

For columns like these you need a strategy that allows you to set a high (or low) value that reflects a future high value for the column – even to the extent (for example) of running a scheduler task that calls dbms_stats.set_column_stats() every hour to push the high value forward a little bit.

Update [11th Dec 2015]

Since this note is about preemptive strategies for reducing the number of problems you run into with 12c, it’s worth reminding you about the new “table preference” which you can use to get more realistic values for the clustering_factor on indexes. For example:


begin
        dbms_stats.set_table_prefs(user, 't1', 'table_cached_blocks', 16);
        dbms_stats.gather_index_stats(user, 't1_i1');
end;
/

Setting the table_cached_blocks preference for a table means Oracle will remember a history of recently “visited” table blocks as it walks an index to generate the clustering_factor. Although there is no official word, I think that 16 is a good default value for this setting in single instance Oracle, and 16 * {number of instances} might be appropriate for RAC.

The fragment above is just for demo purposes – you wouldn’t set the value every time you gather stats on an index, it’s a one-off exercise for each table – though you could change it for all existing tables in a schema with a single call to dbms_stats.set_schema_prefs().

The reason why this preference becomes more important in 12c is that the more expensive an execution path is the longer the optimizer will work to find a better path – and bad settings for the clustering_factor make otherwise good paths looks expensive and encourage the optimizer to try all sorts of before deciding on the final path.

Wrap-up

This is just a brief note to pick up a few key points that you can identify as potential threats before you start upgrading. Obviously it will also be of some help after you’ve upgraded (and are still in test mode) to help you explain some of the changes to execution plans and increases in parse times that you are likely to see. The benefit of pre-emptive action, though, is that you may be able to minimise the number of hidden actions (creation of directives and extended stats) that Oracle takes during your testing. It used to be hard enough doing an upgrade in the past, going up to 12c it can be even harder because Oracle may have changed your test database behind your back when you weren’t looking leaving you running a production system that doesn’t match the system you’ve tested.

** Footnote: if you went to Tim Hall’s presentation you’ll know that part of your upgrade process should deal with the move to the multi-tenant architecture, even if you adopt just the single-PDB licence.

 

December 9, 2015

12c Scalar Subquery

Filed under: 12c,Execution plans,Oracle — Jonathan Lewis @ 2:25 pm GMT Dec 9,2015

Every version of the optimizer enhances existing mechanisms and introduces new features and 12c has introduced some of the most sophisticated transformation to date; in this note I want to demonstrate an enhancement to subquery unnesting that could give a significant performance boost to a certain query pattern but which might, unfortunately, result in worse performance.

Historically subquery unnesting turned subqueries (correlated or not) in the where clause into joins. In 12c subquery unnesting can also turn scalar subqueries in the select list into joins – we’ll discuss why this could be a good thing but might occasionally be a bad thing later on in the article, but let’s start with a test case.

Sample data.

In my demonstration I’m going to use three tables which, for convenience, are three clones of the same data.

create table t1
as
with generator as (
	select
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	rownum				id,	
	mod(rownum-1,100)		mod100,
	trunc((rownum - 1)/100)		trunc100,
	rownum				n1,
	rownum				n2,
	lpad(rownum,6,'0')		vc1,
	rpad('x',100)			padding
from
	generator
where
	rownum <= 10000
;

create table t2 as select * from t1;
create table t3 as select * from t1;

create index t1_i1 on t1(id);
create index t2_i1 on t2(id,mod100);
create index t3_i1 on t3(id,trunc100);

begin
	dbms_stats.gather_table_stats(user,'t1');
	dbms_stats.gather_table_stats(user,'t2');
	dbms_stats.gather_table_stats(user,'t3');
end;
/

I’ll be examining a query against t1 that includes two correlated scalar subqueries in the select list that reference one each of t2 and t3:


explain plan for
select
	/*+
		qb_name(main)
	*/
	n1, n2,
	(
		select	/*+ qb_name(sq1) */
			max(mod100)
		from	t2
		where	t2.id = t1.id
	) new_n1,
	(
		select	/*+ qb_name(sq2) */
			max(trunc100)
		from	t3
		where	t3.id = t1.id
	) new_n2
from
	t1
where
	t1.id between 101 and 200
;

select * from table(dbms_xplan.display);

11g Plan

This is the execution plan you might expect to see from 11g – in my case, with my system stats etc. and running 11.2.0.4:


--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |   101 |  1212 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE              |       |     1 |     7 |            |          |
|   2 |   FIRST ROW                  |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN (MIN/MAX)| T2_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   4 |  SORT AGGREGATE              |       |     1 |     7 |            |          |
|   5 |   FIRST ROW                  |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  6 |    INDEX RANGE SCAN (MIN/MAX)| T3_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   7 |  TABLE ACCESS BY INDEX ROWID | T1    |   101 |  1212 |     4   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN           | T1_I1 |   101 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"=:B1)
   6 - access("T3"."ID"=:B1)
   8 - access("T1"."ID">=101 AND "T1"."ID"<=200)

As you can see, the operations for the subqueries appear first in the plan (lines 1-3, and 4-6), with the operations for the main query appearing as the last section of the plan (lines 7-8). You might note that the total cost of the plan doesn’t cater for the cost of the subqueries – technically we might expect to see the optimizer producing a cost of something like 408 on the basis that it’s going to run each subquery an estimated 101 times and each subquery has a cost of 2, and the 101 rows are generated from a query with a cost of 4 giving: 4 + 101 * (2 + 2) = 408.

12c Plan

On the upgrade to 12c, the same code produces the following plan:


--------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |          |   101 |  6464 |     8   (0)| 00:00:01 |
|*  1 |  HASH JOIN OUTER                      |          |   101 |  6464 |     8   (0)| 00:00:01 |
|*  2 |   HASH JOIN OUTER                     |          |   101 |  3838 |     6   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |   101 |  1212 |     4   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                  | T1_I1    |   101 |       |     2   (0)| 00:00:01 |
|   5 |    VIEW                               | VW_SSQ_2 |   101 |  2626 |     2   (0)| 00:00:01 |
|   6 |     HASH GROUP BY                     |          |   101 |   707 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN                 | T2_I1    |   101 |   707 |     2   (0)| 00:00:01 |
|   8 |   VIEW                                | VW_SSQ_1 |   101 |  2626 |     2   (0)| 00:00:01 |
|   9 |    HASH GROUP BY                      |          |   101 |   707 |     2   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN                  | T3_I1    |   101 |   707 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("ITEM_1"(+)="T1"."ID")
   2 - access("ITEM_2"(+)="T1"."ID")
   4 - access("T1"."ID">=101 AND "T1"."ID"<=200)
   7 - access("T2"."ID">=101 AND "T2"."ID"<=200)
  10 - access("T3"."ID">=101 AND "T3"."ID"<=200)

As you can see, the separate plans for the subqueries have disappeared and the plan is showing a three-table join (with two outer joins, and two of the “tables” being the non-mergeable view vw_ssq_2 and vw_ssq_1).

There are several details to pick up in this plan (apart from the unnesting). The cost is only 8 – but in this case it isn’t the effect of the optimizer “ignoring” the cost of the subqueries, it’s the optimizer correctly working out the cost of the unnested subqueries with joins. The cost happens to be low in this case because the optimizer has used transitive closure to pass the predicate from the driving query into the subqueries – so we need only do a couple of short index range scans to get all the data we need in the unnested subqueries.

The outer joins are needed because it is valid for the original scalar subquery mechanism to return no data for a subquery and still report a row (with nulls) for t1. If the rewrite didn’t introduce the outer join then t1 rows for which there were no matching t2 or t3 rows would disappear from the result set.

Threats and workarounds

In this (lightweight) example it looks as if this transformation is a good idea, but it’s always possible that the optimizer might choose to do this when it’s a bad idea. In fact, a quick check of the optimizer trace (10053) suggests that this is an uncosted transformation that will take place “because it can”. Here are six highly suggestive consecutive lines from the trace file:


SU: Unnesting query blocks in query block MAIN (#1) that are valid to unnest.
Subquery Unnesting on query block MAIN (#1)
SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block MAIN (#1).
SU:   Unnesting  scalar subquery query block SQ2 (#2)
Registered qb: SEL$2E540226 0x50b1a950 (SUBQ INTO VIEW FOR COMPLEX UNNEST SQ2)

Even if this transformation is cost-based rather than heuristic it’s always possible for the optimizer to make a very poor estimate of cost and do the wrong thing. Fortunately it’s possible to block the unnesting with the “traditional” /*+ no_unnest */ hint:


select
	/*+
		qb_name(main)
		no_unnest(@sq1)
		no_unnest(@sq2)
	*/
	n1, n2, 
...

With these hints in place the execution plan changes back to the 11g form – though there is a notable change in the estimated final cost of the query:


---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |   101 |  1212 |   206   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                     |       |     1 |     7 |            |          |
|   2 |   FIRST ROW                         |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN (MIN/MAX)       | T2_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   4 |  SORT AGGREGATE                     |       |     1 |     7 |            |          |
|   5 |   FIRST ROW                         |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  6 |    INDEX RANGE SCAN (MIN/MAX)       | T3_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   7 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |   101 |  1212 |     4   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN                  | T1_I1 |   101 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"=:B1)
   6 - access("T3"."ID"=:B1)
   8 - access("T1"."ID">=101 AND "T1"."ID"<=200)

It’s a little surprising that the final cost is (4 + 202) rather than the  (4 + 404) that we calculated earlier, but a few variations on the test suggest that the optimizer is using half the cost of each of the scalar subqueries in the final cost estimate – perhaps as a nod in the direction of scalar subquery caching.

As always it is important to remember that you cannot look at a query like this and know which plan is going to perform better unless you know the data content, the pattern of data distribution, and the probably effect of scalar subquery caching. In some cases it may be that an extremely “lucky” data pattern will mean that a scalar subquery will run a very small number of times while, with the same data arranged in a different order the benefit of scalar subquery caching will disappear and the unnesting approach may be much better.

It is a convenience to be able to recognize that there is a new optimizer feature in 12c that might give you a performance boost but might hit you with an unlucky performance penalty (especially in cases, perhaps, where you have previously engineered your code very carefully to take advantage of scalar subquery caching) and know that there is a workaround available. In your search for pre-empting production performance problems, this code structure is another pattern to scan for in your source code.

November 24, 2015

Table Expansion

Filed under: 12c,Bugs,Oracle,Partitioning — Jonathan Lewis @ 12:13 pm GMT Nov 24,2015

I’ve often found that while I’m investigating one Oracle feature I get waylaid by noticing anomalies in other parts of the code. I was caught by one of these events a little while ago while experimenting with the new (12.1.0.2) Inmemory Columnar Store.  After reading a posting by Martin Bach I asked the question:

“If you have a partitioned table with a local index and one of the table partitions has been declared INMEMORY, would a query that could use that index be able to apply table expansion to produce a plan that did a tablescan on the in-memory partition and an indexed access path on the partitions that weren’t in-memory?”

The answer was yes, with one important caveat – the first test I built to investigate the question did something very silly and produced the wrong results. In fact the error turned out to have nothing to do with the inmemory columnar store, it also appeared when I disabled the inmemory feature off and nudged the optimizer into table expansion by making one of the index partitions unusable. Here’s the demo, followed by a few notes, if you want to test it for yourselves:


create table t1 (
        id,
        n1,
        padding
)
partition by range (id) interval (250000) (
        partition p_start values less than (250000)
        segment creation immediate
        inmemory priority critical
)
nologging
as
with generator as (
        select  --+ materialize
                rownum id 
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum - 1              id,
        mod(rownum - 1,1250)    n1,             -- 200 rows per segment
        rpad('x',400,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

begin dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create index t1_i1 on t1(n1) local nologging;

I’ve created an interval-partitioned table holding a million rows with an interval of 250,000, and I’ve defined the first partition with an upper bound of 250,000 and the inmemory attribute, leaving Oracle to add the other three partitions which won’t have the inmemory attribute set.

I’ve created a local index on the n1 column, with 200 rows (1,250 distinct values) per partition. Because of the way I’ve defined n1 I can easily recreate the table to adjust the relative code of the index accessed path and the tablescan path by adjusting the value I use in the mod() function. The use of the mod() function also means that every partition holds the same volume of data (with exactly the same degree of scattering) for any specific value of n1.

To test the option for table expansion I’ve then checked the execution plan for (and run) four queries:


select
        /*+ full(t1) */
        id
from
        t1
where
        n1 = 1
and     id < 250000
;

select
        /*+ index(t1) */
        id
from
        t1
where
        n1 = 1
and     id < 250000 
;
 
select
        /*+ full(t1) */
        id 
from    t1
where   n1 = 1
and     id >= 250000
and     id <  500000
;

select
        id
from
        t1
where
        n1 = 1
;

The first two queries give me the cost of accessing the inmemory partition by tablescan compared to the cost of accessing it by index. The third query is there to demonstrate that the non-inmemory tablescan is a LOT more expensive than the inmemory one. The final query accesses all four partitions to see if the inmemory partition is treated differently from the other three. Here are the four plans:

Default tablescan when limited to the in-memory partition
----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |   200 |  1800 |   184   (9)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE SINGLE     |      |   200 |  1800 |   184   (9)| 00:00:01 |     1 |     1 |
|*  2 |   TABLE ACCESS INMEMORY FULL| T1   |   200 |  1800 |   184   (9)| 00:00:01 |     1 |     1 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - inmemory("N1"=1 AND "ID"<250000)
       filter("N1"=1 AND "ID"<250000)



Index access is more expensive than the inmemory tablescan
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |   200 |  1800 |   201   (0)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE SINGLE                    |       |   200 |  1800 |   201   (0)| 00:00:01 |     1 |     1 |
|*  2 |   TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1    |   200 |  1800 |   201   (0)| 00:00:01 |     1 |     1 |
|*  3 |    INDEX RANGE SCAN                        | T1_I1 |   200 |       |     1   (0)| 00:00:01 |     1 |     1 |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("ID"<250000)
   3 - access("N1"=1)



Tablescan on a non-inmemory partition is much higher than inmemory (and indexed access)
-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |   200 |  1800 |   1891  (2)| 00:00:01 |       |       |
|   1 | PARTITION RANGE SINGLE|       |   200 |  1800 |   1891  (2)| 00:00:01 |     2 |     2 |
|*  2 | TABLE ACCESS FULL     | T1    |   200 |  1800 |   1891  (2)| 00:00:01 |     2 |     2 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"=1 AND "ID">=250000 AND "ID"<500000)



Access all four partitions - no table expansion. Cost = 4 x cost of single partition indexed access path.
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |   800 |  7200 |   807   (1)| 00:00:01 |       |       |
|   1 |  PARTITION RANGE ALL                       |       |   800 |  7200 |   807   (1)| 00:00:01 |     1 |1048575|
|   2 |   TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1    |   800 |  7200 |   807   (1)| 00:00:01 |     1 |1048575|
|*  3 |    INDEX RANGE SCAN                        | T1_I1 |   800 |       |     6   (0)| 00:00:01 |     1 |1048575|
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=1)

So we can see that the inmemory tablescan of a single segment is cheaper than the indexed access path, and much cheaper than the tablescan of a single non-inmemory segment. When we execute the query that accesses all four segments, though, the optimizer chooses to use the same access path for all four partitions rather then splitting the query into one tablescan and three indexed accesses.

This is just a starting point for testing though – there are transformations where the optimizer will only use a particular transformation if the transformed query has a cost that is “X%” lower than it would be without the transformation (and some of these transformations have associated parameters – usually hidden – that can be set to adjust the percentage). Perhaps if I adjusted the data so that the relative benefit of a single inmemory scan was larger; or if I created a table with more partitions and had two inmemory and three not inmemory, or 3 inmemory and 4 not inmemory; or if I had more inmemory partitions than not then maybe the optimizer would spontaneously do table expansion.

Rather than fiddle about with the data my next step was to hint the final query with /*+ expand_table(t1) */. Here’s the resulting execution plan:


------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |         |       |       |  1797 (100)|          |       |       |
|   1 |  VIEW                                        | VW_TE_1 |  1600 | 20800 |  1797   (1)| 00:00:01 |       |       |
|   2 |   UNION-ALL                                  |         |       |       |            |          |       |       |
|   3 |    PARTITION RANGE SINGLE                    |         |   200 |  1800 |   184   (9)| 00:00:01 |     1 |     1 |
|*  4 |     TABLE ACCESS INMEMORY FULL               | T1      |   200 |  1800 |   184   (9)| 00:00:01 |     1 |     1 |
|   5 |    PARTITION RANGE ITERATOR                  |         |   600 |  5400 |   806   (1)| 00:00:01 |     2 |     4 |
|*  6 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |   600 |  5400 |   806   (1)| 00:00:01 |     2 |     4 |
|*  7 |      INDEX RANGE SCAN                        | T1_I1   |   800 |       |     5   (0)| 00:00:01 |     2 |     4 |
|   8 |    PARTITION RANGE INLIST                    |         |   800 | 16800 |   807   (1)| 00:00:01 |       |       |
|   9 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |   800 | 16800 |   807   (1)| 00:00:01 |       |       |
|* 10 |      INDEX RANGE SCAN                        | T1_I1   |   800 |       |     6   (0)| 00:00:01 |       |       |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - inmemory(("N1"=1 AND "T1"."ID"<250000))
       filter(("N1"=1 AND "T1"."ID"<250000)) 6 - filter(("T1"."ID">=250000 AND "T1"."ID"<1000000))
   7 - access("N1"=1)
  10 - access("N1"=1)

Points to notice:
Table expansion has occurred – we did an inmemory full tablescan on the inmemory partition at a cost of 184, and index range scans on the other three partitions at a cost of 805 (which looks more like the cost of visiting 4 partitions), expecting 800 rowids from the local index (a failure of partition elimination) and 600 rows from the table. The partition start/stop columns do show that the optimizer expects to visit just the three correct partitions, although some of the cost and cardinality numbers seem to be appropriate to 4 partitions.

Worse still we see a third branch to the UNION ALL / table expansion – operations 8 to 10 – which don’t report a partition start and stop. What’s that doing ? Sometimes, of course, we see sections of plan that don’t actually run – usually preceded by a FILTER operation that can evaluate to FALSE – so maybe that’s what’s happening here. Re-running the query with rowsource execution stats enabled it turned out that the PARTITION RANGE INLIST started once, and operations 9 and 10 didn’t operate at all – so that’s okay.

But what happens if I execute a query that should only access the first two partitions ? Here’s the run-time plan to answer that question:


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID  5dgp982ffsph8, child number 0
-------------------------------------
select  /*+ expand_table(t1) gather_plan_statistics */         id from
       t1 where  n1 = 1 and id < 500000

Plan hash value: 2876620067

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |         |      1 |        |    800 |00:00:00.05 |     649 |
|   1 |  VIEW                                        | VW_TE_1 |      1 |   1200 |    800 |00:00:00.05 |     649 |
|   2 |   UNION-ALL                                  |         |      1 |        |    800 |00:00:00.01 |     649 |
|   3 |    PARTITION RANGE SINGLE                    |         |      1 |    200 |    200 |00:00:00.01 |       3 |
|*  4 |     TABLE ACCESS INMEMORY FULL               | T1      |      1 |    200 |    200 |00:00:00.01 |       3 |
|   5 |    PARTITION RANGE SINGLE                    |         |      1 |    200 |    200 |00:00:00.01 |     215 |
|*  6 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |      1 |    200 |    200 |00:00:00.01 |     215 |
|*  7 |      INDEX RANGE SCAN                        | T1_I1   |      1 |    200 |    200 |00:00:00.01 |      15 |
|   8 |    PARTITION RANGE INLIST                    |         |      1 |    800 |    400 |00:00:00.01 |     431 |
|   9 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |      2 |    800 |    400 |00:00:00.01 |     431 |
|* 10 |      INDEX RANGE SCAN                        | T1_I1   |      2 |    800 |    400 |00:00:00.01 |      31 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - inmemory(("N1"=1 AND "T1"."ID"<250000))
       filter(("N1"=1 AND "T1"."ID"<250000))
   6 - filter(("T1"."ID"<500000 AND "T1"."ID">=250000))
   7 - access("N1"=1)
  10 - access("N1"=1)

My query should have returned 400 rows, accessing only the first two partitions. It returned 800 rows, accessing all 4 partitions. (Notice how there are no predicates on ID for operations 9 and 10).

Since I had started my investigation with the inmemory option I assumed at this point that the problem was somehow related to the that feature; which shows how easy it is to get too focused and jump to conclusions. After raising the problem with Oracle I got a reply that the problem wasn’t about the inmemory columnar store – and here’s the next little change to test to demonstrate that point:


alter index t1_i1 indexing partial;

alter table t1 modify partition p_start indexing off;
alter table t1 modify partition p_start no inmemory;

select  partition_name, indexing, inmemory 
from    user_tab_partitions 
where   table_name = 'T1';

select  partition_name, segment_created    
from    user_ind_partitions 
where   index_name = 'T1_I1';

Enable partial indexing for the index, switch off the index on the first partition then disable the inmemory option for the partition. This is the plan I got from re-running the two-partition query:


---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |         |      1 |        |    800 |00:00:00.08 |   15370 |  14706 |
|   1 |  VIEW                                        | VW_TE_1 |      1 |   1200 |    800 |00:00:00.08 |   15370 |  14706 |
|   2 |   UNION-ALL                                  |         |      1 |        |    800 |00:00:00.07 |   15370 |  14706 |
|   3 |    PARTITION RANGE SINGLE                    |         |      1 |    200 |    200 |00:00:00.01 |     216 |      0 |
|*  4 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |      1 |    200 |    200 |00:00:00.01 |     216 |      0 |
|*  5 |      INDEX RANGE SCAN                        | T1_I1   |      1 |    200 |    200 |00:00:00.01 |      16 |      0 |
|   6 |    PARTITION RANGE SINGLE                    |         |      1 |    200 |    200 |00:00:00.04 |   14723 |  14706 |
|*  7 |     TABLE ACCESS FULL                        | T1      |      1 |    200 |    200 |00:00:00.04 |   14723 |  14706 |
|   8 |    PARTITION RANGE INLIST                    |         |      1 |    800 |    400 |00:00:00.01 |     431 |      0 |
|   9 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T1      |      2 |    800 |    400 |00:00:00.01 |     431 |      0 |
|* 10 |      INDEX RANGE SCAN                        | T1_I1   |      2 |    800 |    400 |00:00:00.01 |      31 |      0 |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("T1"."ID"<500000 AND "T1"."ID">=250000))
   5 - access("N1"=1)
   7 - filter(("N1"=1 AND "T1"."ID"<250000))
  10 - access("N1"=1)


Again I accessed all 4 partitions and returned 800 rows. As an odd little detail the plan reversed the order of access of partitions 1 and 2. The problem isn’t about the inmemory option, it’s a problem with generic table expansion.

Continuing Tests
I won’t go into all the details of what I did next – once you get started it’s hard to stop, and easy to lose track of what you’ve done. So here’s a quick summary.

Given the problem appeared without tthe inmemory columnar store enabled, I switched back to 11.2.0.4 (where table expansion is also possible) and emulated the problem by setting the first index partition unusable (since partial indexing is a 12c feature). Table expansion did not occur even when hinted: the plan was a full tablescan on both partitions.

So I switched the table from being interval partitioned to simple range partitioned, creating all 4 partitions as I created the data. In 11g I got table expansion and the correct answer without the extra branch to the union all; so I went back to 12c and did the same – pure range partitioning, partial indexing, and got table expansion with the correct result and no spare branches to the plan.

Tentative Conclusion

Clearly my testing is by no means exhaustive – but 12c seems to allow table expansion for interval partitioning in cases where 11g does not; unfortunately it is possible for the table expansion code to go wrong for interval partitioning in cases where simple range partitioning does not. In my case this led to wrong results.

 

Reference Script: in_memory_table_expansion_*.sql

 

 

November 9, 2015

Wrong Results

Filed under: 12c,Bugs,Oracle — Jonathan Lewis @ 6:23 am GMT Nov 9,2015

Here’s a little gem in 12c that arrived in my email a few days ago: a query where the result depends on the SQL*Plus arraysize!

The email had a short description, and a script to create a small data set that would demonstrate the problem. I’m not going to show you the query, or the result set, but here’s a sample of the output from an SQL*Plus session after creating the data. This is, by the way, on a “single-user” system – there is no way that some other session is changing the data – especially after the opening “set transaction”:

SQL> set transaction read only;

Transaction set.

SQL> set arraysize 1
SQL> select ...

...

541 rows selected.

SQL> set arraysize 4
SQL> select ...

...

599 rows selected.

SQL> set arraysize 10
SQL> select ...

...

620 rows selected.

SQL> set arraysize 32
SQL> select ...

...

616 rows selected.

The correct result set should have had the 616 rows reported when the arraysize was set to 32 (of, course, it’s possible with an arraysize of 32 the 616 rows returned weren’t the correct 616 rows – rows seemed to get multiplied or deleted fairly arbitrarily as the arraysize changed).

The execution plan was a little unusual in that it forced a nested loop join with a tablescan on the inner table; and when I ran the query with rowsource execution statistics enabled the number of starts of the inner tablescan was 1,597 but the number of rows actually returned varied. My first thought was that some new mechanical optimisation of the tablescan code was losing track of where it had got to in the repeated tablescans – but it turned out I was wrong.

Here’s the execution plan (with camouflage) – the key detail is in a section I didn’t look at intially, the column projection:


select * from table(dbms_xplan.display_cursor('0dh0kh9qa88mz',1,'-note +projection'));

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |       |       | 14118 (100)|          |
|   1 |  NESTED LOOPS       |      |    29 |  2958 | 14118   (2)| 00:00:01 |
|*  2 |   HASH JOIN         |      |   892 | 57088 |    35   (3)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| ABC  |   549 | 21411 |    17   (0)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| DEF  |   892 | 22300 |    17   (0)| 00:00:01 |
|*  5 |   TABLE ACCESS FULL | XYZ  |     1 |    38 |    16   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("DEF"."ABC_FK"="ABC"."ABC_ID")
   5 - filter(("DEF"."COL0"="XYZ"."COL0" AND "XYZ"."COL1"="ABC"."COL1"
              AND "XYZ"."COL2"="ABC"."COL2"))

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - "ABC"."ABC_ID"[NUMBER,22], "DEF"."ABC_FK"[NUMBER,22],
       "ABC"."COL2"[NUMBER,22], "ABC"."COL1"[NUMBER,22],
       "DEF"."COL0"[VARCHAR2,20], "XYZ"."COL1"[NUMBER,22],
       "XYZ"."COL2"[NUMBER,22], "XYZ"."COL0"[VARCHAR2,20]
   2 - (#keys=1) "ABC"."ABC_ID"[NUMBER,22], "DEF"."ABC_FK"[NUMBER,22],
       "ABC"."COL2"[NUMBER,22], "ABC"."COL1"[NUMBER,22],
       "DEF"."COL0"[VARCHAR2,20]
   3 - (rowset=200) "ABC"."ABC_ID"[NUMBER,22], "ABC"."COL1"[NUMBER,22],
       "ABC"."COL2"[NUMBER,22]
   4 - (rowset=200) "DEF"."ABC_FK"[NUMBER,22], "DEF"."COL0"[VARCHAR2,20]
   5 - "XYZ"."COL1"[NUMBER,22], "XYZ"."COL2"[NUMBER,22],
       "XYZ"."COL0"[VARCHAR2,20]

The predicate section is irrelevant in this case, and I’ve camouflaged the names of the tables and columns – the only interesting bit is the appearance of the (rowset=200) in the projection information. This is reporting a feature new in 12c (and not to be confused with Oracle Java Rowsets) that should improve the performance of some queries.

I didn’t actually look at the projection information until after I’d asked the Oak Table members if they had ever seen this type of anomaly before – and Stefan Koehler emailed back to suggest that the problem might be related to rowsets (there are a couple of similar bugs on MoS, e.g: 17016479 and 20960570) – so I checked the projection, then repeated my tests after disabling the feature with a call to: ‘alter session set “_rowsets_enabled”=false;’

Problem solved – although I’ve told the person who emailed me to report this discovery and workaround to Oracle support and see what they supply as the approved solution.

It is possible to affect the feature through event 10055 – different levels disable it at different locations in the code; the list of options is given in the $ORACLE_HOME/rdbms/mesg/oraus.msg file (if you’re not running Windows):


//            Level:
//            0x00000001 - turn off for table scan
//            0x00000002 - turn off for hash join consume
//            0x00000004 - turn off for hash join produce
//            0x00000008 - turn off for group by
//            0x00000010 - turn off for sort
//            0x00000020 - turn off for table-queue out
//            0x00000040 - turn off for table-queue in
//            0x00000080 - turn off for identity
//            0x00000100 - turn off for granule iterator
//            0x00000200 - turn off for EVA functions
//            0x00000400 - turn off for PL/SQL
//            0x00000800 - turn off for upgrade
//            0x00001000 - turn off for database startup
//            0x00002000 - turn off for blobs and clobs
//            0x00004000 - turn off for tracing row source
//            0x00008000 - turn off rowset information in explain plan
//            0x00010000 - disable hash join rowsets fast path
//            0x00020000 - turn off for bloom create
//            0x00040000 - turn off for bloom use
//            0x00080000 - disable prefetch for hash join
//            0x00100000 - disable prefetch for bloom
//            0x00200000 - disable semi blocking hash join
//            0x00400000 - turn off rowset for fixed table

I tried the first few levels and found that both levels 1 and 2 eliminated the problem (and eliminated the appearance of the (rowset=200) entry in the projection information). Given the shape of the plan I had thought that just one of 1,2 or 4 might have been relevant so I was a little surprised to find that both 1 and 2 were effective – but that’s probably just a question of interpretation of the brief descriptions.

Update

I’ve asked the owner of the problem if it’s okay to post the script to create the tables and data – and the answer was yes: the content was already heavily camouflaged anyway. So here’s a file you can download if you want to test other environments: insert_script_12c_bug

It’s declared as a “.doc” file to get past the upload process, but it’s really a flat text file.

Update 16th Nov

The official “minimum impact” workaround is to set event 10055 at level 2097152 (disable semi blocking hash join). Alternatively there is also a patch available. See Mike Dietriech’s blog for details: https://blogs.oracle.com/UPGRADE/entry/update_for_switch_off_rowsets

Update 7th Dec

Mike Dietriech has updated his second posting to add details of a patch for this bug.

 

October 9, 2015

PL/SQL Functions

Filed under: 12c,Oracle,Performance — Jonathan Lewis @ 6:17 pm GMT Oct 9,2015

Assuming everything else has been tuned to perfection, what’s the best you can do while calling PL/SQL functions from SQL ? Here’s a little code to create a table with some data, and a function that we can use to start an investigation:


create table t1
nologging
pctfree 0
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(dbms_random.value(1,1001))        n1
from
        generator       v1,
        generator       v2
where
        rownum <= 3e5
;

create or replace function func_normal(i_in number)
return number
as
begin
        return round(exp(ln(i_in)));
end;
/

That’s 300,000 rows in the table and a silly little function to use up some CPU to get nowhere. There are 1,000 distinct integer values scattered uniformly through the table, and the function returns the value it’s called with – but it does it a hard way.

Here’s some test code – with “set timing on” – followed by the results:


select count(distinct(n1)) from t1;

select 
        count(distinct(func_normal(n1)))
from    t1
;



COUNT(DISTINCT(N1))
-------------------
               1000

1 row selected.

Elapsed: 00:00:00.03

COUNT(DISTINCT(FUNC_NORMAL(N1)))
--------------------------------
                            1000

1 row selected.

Elapsed: 00:00:11.39

This is running on 12.1.0.2 on a machine with a CPU speed of aboaut 3 MHz – you may want to adjust the number of rows in the table for your own testing.

The question is, what options can we use to improve the efficiency of the calls to the PL/SQL. One option, of course, is to use the new 12c “with PL/SQL” clause – we could embed the function in the SQL like this:


with
        function func_with (
                i_in    number
        ) return number
        is
        begin
                return round(exp(ln(i_in)));
        end;
select
        count(distinct(func_with(n1)))
from    t1
/

COUNT(DISTINCT(FUNC_WITH(N1)))
------------------------------
                          1000

1 row selected.

Elapsed: 00:00:09.77

So, for the cost of copying the function into the SQL we get a 10% improvement in performance – which we could reasonably attribute to an improved efficiency in the call mechanism. There are arguments for and against copying code like this, of course, and my improvement was only 1 second for 300,000 calls, but you may decide that the benefit of the “With-PL/SQL” method is sufficient to justify the effort.

If you don’t want to copy a PL/SQL function into the SQL, though, there is another alternative – the pragma UDF (for user-defined function) – which has the effect of reducing the cost of the so-called “context switch” as you call from SQL to PL/SQL or vice versa.

Technically I think what’s happening is that the stack formats for SQL and PL/SQL are different and the context switch is the work needed to reformat the stack as you pass from one environment to the other – by declaring the function as UDF you probably set it up to deal with the incoming SQL stack itself.

Here’s how to add the UDF pragma to the function, and the results showing the effects:


create or replace function func_udf(i_in number)
return number
as
pragma UDF;
begin
        return round(exp(ln(i_in)));
end;
/

COUNT(DISTINCT(FUNC_UDF(N1)))
-----------------------------
                         1000

1 row selected.

Elapsed: 00:00:09.55

With pragma UDF the standalone function call is slightly faster than the in-line “WITH” function. I did wonder whether the pragma UDF would make the function call slower if I simply called it from a loop in a PL/SQL block, but there didn’t seem to be any significant difference between the normal function and the UDF function.

Initially, then, it looks like UDF is faster than WITH, which is faster than basic; but there are other considerations. My sample data has only 1,000 possible inputs to the function – and Oracle has three different methods for caching that I might be able to benefit from:

  • Declaring the function as deterministic
  • Putting the function into the PL/SQL result cache
  • Modifying the SQL to take advantage of scalar subquery caching

Here’s what the function declaration looks like if I want to use the PL/SQL function cache:


create or replace function func_cached(i_in number)
return number
result_cache
as
begin
        return round(exp(ln(i_in)));
end;
/

Changing my query to use func_cached() the query completed in 1.65 seconds – a clear winner, but can anything else get close.

To make the functions deterministic, I just have to add the word “deterministic” after the declaration of the return type:


create or replace function func_normal(i_in number) 
return number
deterministic
as
begin
        return round(exp(ln(i_in)));
end;
/

We can also add the deterministic keyword to the function defined in the WITH clause. Before reporting the results of testing the functions with determinism, there’s one more strategy to consider. Remove the deterministic key word from the functions, and introduce a scalar subquery to the test query, e.g.:


select
        count(distinct(select func_normal(n1) from dual))
from    t1

;

Here’s a table of results:

Method Initial Deterministic Scalar Subquery
Basic 11.39 4.30 4.32
With 9.77 9.72 3.60
UDF 9.55 9.57 3.57
Cached 1.65 0.71 0.72

Summary

Before stating any conclusions it’s worth remembering that the sample data is based on a relatively small number of distinct input values. It is the repetition that allows us to benefit from things like caching and determinism. On top of that we need to consider the scale of the time-saving in light of the number of calls made (or not made).

Having said that, the PL/SQL function cache is clearly the thing that gives us the most benefit in return for a simple implementation. We should remember that the cache is “public” – i.e. stored in the SGA – and each input value takes up another bit of the public result cache: on the plus side this means that everyone else calling this function gets the benefit of our prior calculation; on the minus side this means if we use the result cache for the wrong function then we could take up a lot of space in the cache for very little benefit. Since the relevant result cache latch has not child latches it’s also possible to end up with latch contention if the too many sessions are taking advantage of the result cache too frequently.

Although we might expect a deterministic function to give us a big benefit (in the “very repetitive inputs” case), we find that the deterministic keyword has no effect (as at 12.1.0.2) in functions declared in a WITH clause or declared as standalone with pragma UDF. Fortunately scalar subquery caching (which seems to use the same hashing algorithm as the deterministic caching algorithm) still works with WITH functions or UDF functions and (slightly surprisingly) standalone functions declared with pragma UDF seem to have a very small edge over WITH functions.

Both of the latter two approaches use a local cache with a strictly limited size. The benefit is that the locality means they won’t interfere with other sessions or end up hogging an extreme amount of a public memory area; the drawback is that the size limit (which can be adjusted with a hidden parameter) means that you can get unlucky with hash collisions and end up with extremely variable performance from day to day because of a small change in the data being processed, or even a change in the order in which an unchanged data set is processed.

Footnote

I’ll leave you with one thought. Without declaring the original function as deterministic, I ran the following query to maximise the benefit from the scalar subquery caching algorithm:


select
        count(distinct(select func_normal(n1) from dual))
from    (
        select /*+ no_eliminate_oby */  n1 from t1 order by n1
        )
;

The query completed in 0.13 seconds: the subquery was called 1,000 times (once for each distinct value – see this ancient URL), and the benefit of eliminating the function calls outweighed the cost of having to sort the data set.

Ref: 12c_function_options.sql

September 28, 2015

Result Cache 2

Filed under: 12c,Flashback,Oracle,Performance — Jonathan Lewis @ 8:50 am GMT Sep 28,2015

Following on from my earlier posting of problems with temporary table and the PL/SQL result cache (a combination which the manuals warn you against) here’s another problem – again, to a large degree, self-inflicted.

Imagine you have a complex report involving a large number of financial transactions with a need to include calculations about current exchange rates. Unfortunately the rules about calculating the appropriate exchange rate for any transaction are complex and you find you have a choice between adding 6 tables with outer joins and a couple of aggregate (max) subqueries to the base query or calling a PL/SQL function to calculate the exchange rate for each row. I’m going to create an extremely simplified model of this requirement:

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        case mod(rownum,2) when 0 then 'USD' else 'GBP' end v1,
        case mod(rownum,2) when 0 then 'GBP' else 'USD' end v2
from
        generator       v1
where
        rownum <= 1e3
;

create table t2 (
        v1      varchar2(3),
        v2      varchar2(3),
        cvt     number(10,6),
        constraint t2_pk primary key(v1,v2)
)
organization index
;

insert into t2 values('USD','GBP',0);
insert into t2 values('GBP','USD',1);

commit;

create or replace function plsql_ordinary(
        from_cur        in varchar2,
        to_cur          in varchar2
)
return number is
        m_ret number;
begin
        select /*+ ordinary trace_this */
                cvt
        into    m_ret
        from    t2
        where   v1 = from_cur
        and     v2 = to_cur
        ;

        return m_ret;

end plsql_ordinary;
/

execute dbms_stats.gather_table_stats(user,'t2')

My t1 table represents the set of transactions but only has to worry about two exchange rates, the dollar/sterling and its inverse. My t2 table is an exchange rate table and I’ve loaded it with the two exchange rates I’m interested in. My function plsql_ordinary() takes two currency codes as inputs and returns the exchange rate.

Here’s the modelling query, with a little infrastructure to examine the workload. Note that this will only run on 12c because of the inline PL/SQL function I’ve used for slowing the query down.

set pagesize 20
set arraysize 6

set serveroutput off
alter system flush shared_pool;
alter session set statistics_level = all;

with
        function wait_row_n (
                i_secs          number,
                i_return        number default -1
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end wait_row_n;
select
        /*+ driver trace_this */
        wait_row_n(0.3,id),
        plsql_ordinary(v1,v2),
        (select /*+ scalar trace_this */ t2.cvt from t2 where t2.v1 = t1.v1 and t2.v2 = t1.v2) scalar_sub
from
        t1
where
        rownum <= 50
;

set pagesize 40

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

select  sql_id, executions, fetches, rows_processed, sql_text
from    v$sql
where   lower(sql_text) like 'select%trace_this%'
and     lower(sql_text) not like '%v$sql%'
;

The query includes a scalar subquery in the select list to get the same data as the PL/SQL function, and you’ll see the point of that in a while. Because of the arraysize and input parameters to wait_row_n() the query will produce output in batches of 6 rows roughly every two seconds for a total of about 18 seconds – which will give me plenty of time to play around in another session. Before I try to do any damage, though, let’s check the execution plan of the report and the statistics of the queries with “trace_this” in their text:


--------------------------------------------------------------------------------------
| Id  | Operation          | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |      1 |        |     50 |00:00:00.01 |      11 |
|*  1 |  INDEX UNIQUE SCAN | T2_PK |      2 |      1 |      2 |00:00:00.01 |       2 |
|*  2 |  COUNT STOPKEY     |       |      1 |        |     50 |00:00:00.01 |      11 |
|   3 |   TABLE ACCESS FULL| T1    |      1 |     50 |     50 |00:00:00.01 |      11 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."V1"=:B1 AND "T2"."V2"=:B2)
   2 - filter(ROWNUM<=50)


SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED
------------- ---------- ---------- --------------
SQL_TEXT
-----------------------------------------------------------------------------
f1bz07bk5rbth         50         50             50
SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

Looking at operation 1 in the plan you can see that the scalar subquery has started just twice (once for each distinct combination of currency codes).
Looking at the results from v$sql you can see that the query in the PL/SQL function was executed 50 times – once for each row. Functions like this can be a massive drain of resources (typically CPU, but also through latch contention on buffer cache and shared pool latches).

Conveniently we realise that in our system the derived exchange rates change very slowly – so how about telling Oracle that the exchange rate function is a deterministic function (it’s nearly true), or better still, perhaps, experiment with the PL/SQL Function Result Cache.

(Note very carefully, however, that the Database Administrators’ Manual for 11.2 makes the following comment about using the deterministic keyword with PL/SQL functions)

DETERMINISTIC

Tells the optimizer that the function returns the same value whenever it is invoked with the same parameter values (if this is not true, then specifying DETERMINISTIC causes unpredictable results).

Look carefully at that “unpredictable” – it’s true … but you might not realise it until too late. Our PL/SQL function is NOT deterministic – after all a function that queries the database to produce a result may produce a different result every time it executes if someone keeps changing the underlying data – but we might wave our hands a bit and say that the table isn’t going to change while we’re running our report so it’s okay to pretend it’s deterministic, we might even make it a read-only table for the duration. Similar thoughts should cross our minds about declaring a PL/SQL function to the result cache – even though the manual doesn’t say anything quite so explicitly threatening about the result cache. But what the heck, let’s just do it and see what happens:

create or replace function plsql_result_cache(
        from_cur        in varchar2,
        to_cur          in varchar2
)
return number
result_cache
is
        m_ret number;
begin
        select /*+ result cache trace_this */
                cvt
        into    m_ret
        from    t2
        where   v1 = from_cur
        and     v2 = to_cur
        ;

        return m_ret;

end plsql_result_cache;
/

create or replace function plsql_deterministic(
        from_cur        in varchar2,
        to_cur          in varchar2
)
return number
deterministic
is
        m_ret number;
begin
        select /*+ det trace_this */
                cvt
        into    m_ret
        from    t2
        where   v1 = from_cur
        and     v2 = to_cur
        ;

        return m_ret;

end plsql_deterministic;
/

...
select
        /*+ driver trace_this */
        wait_row_n(0.3,id),
        plsql_ordinary(v1,v2),
        plsql_deterministic(v1,v2),
        plsql_result_cache(v1,v2),
        (select /*+ scalar trace_this */ t2.cvt from t2 where t2.v1 = t1.v1 and t2.v2 = t1.v2) scalar_sub
from
...

All three functions returned the same set of results as the scalar subquery – and here’s the output from v$sql showing the impact of declaring a deteministic function and a result cache function (note that “result cache” is not a hint in the first statement, it’s just a convenient label):


SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED
------------- ---------- ---------- --------------
SQL_TEXT
-----------------------------------------------------------------------------------
49buxp3gba3cg          2          2              2
SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

2sh7bm59dkwhw         18         18             18
SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

f1bz07bk5rbth         50         50             50
SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1


The simple function ran 50 times, the deteministic function ran 18 times, and the result cache function ran twice. It required just two executions to get the two distinct results needed, after which the session used the result cache rather than calling the function again.

The deterministic function only remembers its results for the duration of a single database call – in this case the fetch – so on each fetch the session has to re-populate the session’s “deterministic cache”, which takes 2 calls for each fetch, a total of 9 fetches * 2 calls = 18 calls.

Clearly, if I can get away with it safely, the PL/SQL function result cache looks like a real winner, with the deterministic option coming a close second (given that real life ought to be using a significantly larger fetch arraysize). So what could possibly go wrong ? Two things – first, the results … and if the results can go wrong there’s not much point in talking about the second thing at the moment.

My query runs for 18 seconds, I’m going to start another session while it runs and update one of the rows in the t2 table a few seconds after my long-running query starts. Here’s the SQL I’ve run, an extract from the query output, and the results from v$sql:


update  t2 set
        cvt = 2
where   v1 = 'USD' 
;

commit;


WAIT_ROW_N(0.3,ID) PLSQL_ORDINARY(V1,V2) PLSQL_DETERMINISTIC(V1,V2) PLSQL_RESULT_CACHE(V1,V2) SCALAR_SUB
------------------ --------------------- -------------------------- ------------------------- ----------
                 1                     1                          1                         1          1
                 2                     0                          0                         0          0
                 3                     1                          1                         1          1
                 4                     0                          0                         0          0
                 5                     1                          1                         1          1
                 6                     0                          0                         0          0
                 7                     1                          1                         1          1
                 8                     0                          0                         0          0
                 9                     1                          1                         1          1
                10                     2                          0                         2          0
                11                     1                          1                         1          1
                12                     2                          0                         2          0
                13                     1                          1                         1          1
                14                     2                          2                         2          0
                15                     1                          1                         1          1
                16                     2                          2                         2          0


SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED
------------- ---------- ---------- --------------
SQL_TEXT
-----------------------------------------------------------------------------------
49buxp3gba3cg          4          4              4
SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

2sh7bm59dkwhw         18         18             18
SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

f1bz07bk5rbth         50         50             50
SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

The most important point is that we’ve got results that are not self-consistent – except for the scalar subquery results.

The SQL statements that are executed inside the PL/SQL functions do not run at the SCN of the driving query, each individual statement executes at its own starting SCN. This is an important point that is often overlooked when people write PL/SQL functions that are then called from SQL. The inline scalar subquery, on the other hand, always runs as at the start SCN of the driving query no matter how many times or how late in the lifetime of the driving query it runs.

If we examine the results we can see that the ordinary PL/SQL function and the result cache PL/SQL function changed their output the moment the commit took place (you’ll have to take my word on that, of course), but the deterministic function seemed to delay slightly. We can also see that the number of executions for the ordinary and deterministic functions didn’t change, but the result cache function doubled its executions.

Because of the way I’ve created my data and defined the function, the ordinary function executes its SQL once every row while the deterministic function executes its SQL twice every fetch (once for each pair of input values (though the first fetch from SQL*Plus is a special case) and then remembers the outputs for the rest of the fetch – this is why there is a delay before the deterministic function changes its output and doesn’t introduce any extra calls to the SQL – it was going to re-execute for both values on the fetch starting at id 13 whatever went on around it; the result cache function gets an invalidation warning the moment the other session commits, so re-executes its SQL as many times as needed to repopulate the bits of the cache that have been discarded – and that’s why we see the number of SQL calls doubling, the relevant part of the cache was identified by some sort of hash value for the statement with SQL_ID = ’49buxp3gba3cg’ so both results were discarded and reloaded even though only one of them actually had to change.

Critically every execution of the recursive statements runs at the then-current SCN – so when the underlying data changes our report sees those changes, the report is not self-consistent.

Fortunately there’s something we can do about this – if we want the whole report to operate at the same SCN all we need to do is freeze our session at a specific point in time with the command “set transaction read only;”. This has to be executed as the first statement of a transaction but if we can slip it in just before we start running our report we can ensure that all our SQL statements (including the recursive ones) run as at the same SCN until we issue a commit. When I repeated the data-changing experiment after setting the transaction read only the report ran to completion showing the correct results.

But there was a significant change in the output from v$sql:


SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED
------------- ---------- ---------- --------------
SQL_TEXT
-----------------------------------------------------------------------------------
49buxp3gba3cg         44         44             44
SELECT /*+ result cache trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

2sh7bm59dkwhw         18         18             18
SELECT /*+ det trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

f1bz07bk5rbth         50         50             50
SELECT /*+ ordinary trace_this */ CVT FROM T2 WHERE V1 = :B2 AND V2 = :B1

I did my update just after the first batch of rows had appeared on screen – notice how the result cache SQL has executed 44 times instead of 2 (or 4) times. When I set my transaction to read only it looks as if my session stopped using the result cache the moment the other session commited – and that’s a GOOD THING. If my session were able to continue using the result cache that would mean one of two things, either I would be seeing a result created by another user – which would be too new for me, or every other session would be seeing the results from my session – which would (typically) be out of date for them. The session seems to have protected itself from the risk of a result cached PL/SQL function producing inconsistent results – but the workload changed the moment another session committed a change to the data we were interested in.

At that point I stopped investigating “set transaction read only” until a couple of days later when I realised that there was one more thing I had to test – when I changed the data from another session I didn’t check to see what that session would see when it executed the cached function, so I modified the code run by the other session to do the following:


update  t2 set 
        cvt = 2 
where   v1 = 'USD' 
;

commit;

execute dbms_lock.sleep(6)
execute dbms_output.put_line(plsql_result_cache('USD','GBP'))

The other session goes to sleep for a while (to let the reporting session get through a little more work) and then calls the function. I was very pleased to see that the session returned the correct result – the value 2 that it had just written to the table. But what I got from the reporting session wasn’t so good:

WAIT_ROW_N(0.3,ID) PLSQL_ORDINARY(V1,V2) PLSQL_DETERMINISTIC(V1,V2) PLSQL_RESULT_CACHE(V1,V2) SCALAR_SUB
------------------ --------------------- -------------------------- ------------------------- ----------
                 1                     1                          1                         1          1
                 2                     0                          0                         0          0
                 3                     1                          1                         1          1
                 4                     0                          0                         0          0
                 5                     1                          1                         1          1
                 6                     0                          0                         0          0
                 7                     1                          1                         1          1
                 8                     0                          0                         0          0
...
                24                     0                          0                         0          0
                25                     1                          1                         1          1
                26                     0                          0                         0          0
                27                     1                          1                         1          1
                28                     0                          0                         0          0
                29                     1                          1                         1          1
                30                     0                          0                         2          0
                31                     1                          1                         1          1
                32                     0                          0                         2          0

SQL_ID        EXECUTIONS    FETCHES ROWS_PROCESSED SQL_TEXT
------------- ---------- ---------- -------------- --------------------------------
49buxp3gba3cg         32         32             32 SELECT /*+ result cache trace_th
                                                   is */ CVT FROM T2 WHERE V1 = :B2
                                                    AND V2 = :B1

49buxp3gba3cg          1          1              1 SELECT /*+ result cache trace_th
                                                   is */ CVT FROM T2 WHERE V1 = :B2
                                                    AND V2 = :B1

2sh7bm59dkwhw         18         18             18 SELECT /*+ det trace_this */ CVT
                                                    FROM T2 WHERE V1 = :B2 AND V2 = 
                                                    :B1    

f1bz07bk5rbth         50         50             50 SELECT /*+ ordinary trace_this * 
                                                   / CVT FROM T2 WHERE V1 = :B2 AND
                                                    V2 = :B1

I changed t2 just after the first batch of rows had appeared (just after id 6), then called the function a few seconds later – and as I called the function from the other session it queried the data (the second child to 49buxp3gba3cg, executed just once above) and reloaded the result cache. At that moment (visible at id 30) the first session found data in the result cache and stopped re-executing its queries. When my session went read only it protected other sessions from the past by NOT re-populating the result cache as it ran its queries – but if it found something in the result cache it used it (notice how it has recorded 32 executions of the query rather than 44 – it found something in the result cache on the last 12 calls of the function). The protection only goes one way.

Observations

Using PL/SQL functions in the select list to encapsulate complex calculations that query the database is not something you can do casually. You have no guarantee that you will end up with a self-consistent result unless you take action to deal with the threats introduced by concurrent activity – ideally all tables involved should be set to read-only (which is only possible in 12c [Ed: see comment below] became possible from 11.1 onwards, though you can set a whole tablespace readonly in earlier versions: neither strategy is commonly viable). If you decide that you can work around those threats you still have the performance threat implied by the need to do some complex work for each driving row of your result set. For a report the simple solution to consistency is to “freeze” the session SCN by setting your session (transaction) into read only mode.

Once you’ve dealt with the consistency problem you can start to address the performance problen by claiming that you were using deterministic functions. You might be tempted to use the PL/SQL Result Cache to give you an even larger performance boost, but if you do you really have to make the table (or tablespace) read-only to be protected from read-consistency problems. The deterministic strategy may not be as dramatic in its effects as the result cache strategy but, given a reasonably sized fetch array, the performance benefit you get may be all you really need.

Whatever else you do, there’s an important underlying threat to remember. The moment you create a PL/SQL function that uses the result cache or deterministic option you have to ensure that nobody uses that function without ensuring that their code has handled the consistency threat properly. It’s too easy to forget, with the passing of time, that certain procedures have to be adopted when particular coding strategies are used.

Left as Exercise

I was planning to write a further article going into some detail about using dbms_flashback.enable_at_time(systimestamp) instead of “set transaction read only” – a mechanism that might be used to achieve the same read-consistency requirement though, perhaps, used less frequently than the older, less flexible option. I was also planning to include notes in the same araticle about the effect of “alter session set isolation_level = serializable” which some systems probably use to get read-consistency across multiple statements while writing results back to the database.

Both strategies run into the same issue as “set transaction read only”, with the added effect that your session (the one that has tried to “fix” its SCN) will repopulate the cache, so not only could you see newer results from other sessions in the cache; other sessions could see older results because you put them into the cache.

I’m not going to write up these demonstrations (which require fairly simple modifications to the code supplied above) as all the phenomena have been recorded as bugs on MoS (along with the GTT problem from my previous post, and a documentation bug for the PL/SQL manual to clarify the various threats):

  • Bug 21905592 : PL/SQL RESULT CACHE MUST BE BYPASSSED WHEN SESSION SCN NOT CURRENT
  • Bug 21905695 : BAD X-SESSION RESULTS W/ RESULT-CACHED PL/SQL FN AND “SET TXN READ ONLY”
  • Bug 21907155 : PL/SQL RC FN SHOWS ONE SESSION THE CONTENTS OF ANOTHER SESSION’S GTT
  • Bug 21885173 : THE LNPLS SECTION “RESTRICTIONS ON RESULT-CACHED FUNCTIONS” NEEDS RE-WRITE

Footnote

I haven’t mentioned it here, but another strategy for reducing the number of PL/SQL calls is simply to wrap the function call inside a scalar subquery, along the lines of:

select
       (select plsql_ordinary(v1, v2) from dual),
       ...

Provided you don’t have many combinations of (v1,v2) to handle, and with a little luck with Oracle’s internal hashing function, you could find that scalar subquery caching reduces your execution count from once per row to once per combination. Note that the function is the “ordinary” function, not the “fake deterministic” one; internally Oracle uses the same hashing strategy for remembering the results, but the duration of the scalar subquery cache is the statement rather than the fetch.

 

September 22, 2015

Result Cache

Filed under: 12c,Bugs,Infrastructure,Oracle,Performance — Jonathan Lewis @ 10:11 am GMT Sep 22,2015

Yesterday I thought I’d spend half an hour before breakfast creating a little demonstration of a feature; some time about midnight I felt it was time to stop because I’d spent enough time chasing around a couple of bugs that produced wrong results in a variety of ways. Today’s short post is just little warning: be VERY careful what you do with the PL/SQL result cache – if you use the results of database queries in the cache you may end up with inconsistent results in your application. Here’s one very simple example of what can go wrong, starting with a little script:


alter system flush shared_pool;

truncate table gtt1;
drop table gtt1;
create global temporary table gtt1 (n1 number)
on commit preserve rows
;

insert into gtt1 values(1);
commit;

execute dbms_stats.gather_table_stats(user,'gtt1');


create or replace function f_cache
return number 
result_cache 
-- relies_on (gtt1)
is
        m_ret number;
begin
        select  max(n1)
        into    f_cache.m_ret
        from    gtt1
        ;

        return f_cache.m_ret;
end;
/

execute dbms_output.put_line(f_cache)

Here’s the output from a session that’s just connected and run this script (the table already existed from an earlier run):

SQL> @temp

System altered.


Table truncated.


Table dropped.


Table created.


1 row created.


Commit complete.


PL/SQL procedure successfully completed.


Function created.

1

PL/SQL procedure successfully completed.

SQL> 

Let’s call this session A, and I’m going to connect with another session which I’ll call session B. The following extracts are cut-and-pastes as I alternate between sessions:

Session B:

SQL> execute dbms_output.put_line(f_cache);
1

PL/SQL procedure successfully completed.

SQL> insert into gtt1 values(0);

1 row created.

SQL> execute dbms_output.put_line(f_cache);
0

PL/SQL procedure successfully completed.


Session B has just seen the data inserted into a global temporary table by session A; but after inserting a row into the table it now sees what it perceives to be the correct answer.

Session A:

SQL> truncate table gtt1;

Table truncated.

SQL> execute dbms_output.put_line(f_cache);
1

PL/SQL procedure successfully completed.

Despite truncating the table, session A still sees the data which has been eliminated unrecoverably.

Session B (where I hadn’t yet committed):


SQL> commit;

Commit complete.

SQL> 

Session A (where I’ve done nothing new):

SQL> execute dbms_output.put_line(f_cache);


PL/SQL procedure successfully completed.

SQL> 

The row has finally “disappeared” because session B committed.

Session B (where I haven’t done anything since committing):


SQL> execute dbms_output.put_line(f_cache);


PL/SQL procedure successfully completed.

SQL> 

Session B no longer sees the data because it’s now seeing what session A has just seen.

Warning.

This is just one of several ways I managed to get surprising inconsistencies when using the (cross-session) PL/SQL Result Cache. Oracle (in 12c) is supposed to know what the cache relies on and react accordingly, but it doesn’t manage to do it properly (even if you include the relies_on clause) – if you query the database in your PL/SQL you may find strange things happen.

The most important point to note in this example is that a session that wasn’t necessarily doing anything wrong got the wrong results because of the actions of another session that Oracle should have protected it from.

The good thing about this is example that it’s documented (sort of) – as it says in the manual: “to be result cached … does not reference … temporary tables ..”. So that should warn people off copying my example; on the other hand the problem I’m seeing arises because Oracle seems to be trying to use the result cache when the manuals suggests it shouldn’t be.

Conclusion

Do not mix the pl/sql result cache with database queries. The cache is public but (unlike the buffer cache) it is not guaranteed to give you read-consistency.

If you think this was an unrealistic example and you don’t need to worry about it – I’ll post a couple more examples in the next couple of weeks. They’ll be slightly special cases again, but I find the Oracle world is full of special cases.

Update

This behaviour is now visible on MoS as “Bug 21907155 : PL/SQL RC FN SHOWS ONE SESSION THE CONTENTS OF ANOTHER SESSION’S GTT”

September 2, 2015

IN/EXISTS bugs

Filed under: 12c,Bugs,CBO,Execution plans,Oracle,subqueries — Jonathan Lewis @ 8:11 am GMT Sep 2,2015

Here’s a simple data set – I’m only interested in three of the columns in the work that follows, but it’s a data set that I use for a number of different models:


execute dbms_random.seed(0)

create table t1
nologging
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		level <= 1e4
)
select
	trunc(dbms_random.value(0,1000))	n_1000,
	trunc(dbms_random.value(0,750))		n_750,
	trunc(dbms_random.value(0,600))		n_600,
	trunc(dbms_random.value(0,400))		n_400,
	trunc(dbms_random.value(0,90))		n_90,
	trunc(dbms_random.value(0,72))		n_72,
	trunc(dbms_random.value(0,40))		n_40,
	trunc(dbms_random.value(0,3))		n_3
from
	generator	v1,
	generator	v2
where
	rownum <= 1e6
;
create table t2 nologging 
as
select * from t1
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt 	 => 'for all columns size 1'
	);

	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T2',
		method_opt 	 => 'for all columns size 1'
	);
end;
/

The columns I want to consider are n_3, n_400, and n_1000. As their names suggest the columns have 3, 400, and 1000 distinct values respectively and since I’ve used the dbms_random.value() function to generate the data the distinct values are fairly evenly spread across the million rows of the table.

Consider, then, the following two queries:


select
        *
from
        t1
where
        exists (
                select  null
                from    t2
                where   n_1000 = 0
                and     t2.n_400 = t1.n_400
                and     t2.n_3 = t1.n_3
        )
;


select
        *
from
        t1
where
        (t1.n_400, t1.n_3) in (
                select  t2.n_400, t2.n_3
                from    t2
                where   t2.n_1000 = 0
        )
;

The first point to check is that these two queries are logically equivalent.

Once you’re happy with that idea we can work out, informally, how many rows we should expect the queries ought to return: there are 1,200 combinations for (n_400, n_3) so each combination should return roughly 833 rows; if we pick 1,000 rows from the 1 million available we can expect to see 679 of those combinations (that’s Alberto Dell’Era’s “selection without replacement” formula that Oracle uses for adjusting num_distinct to allow for filter predicates). So we might reasonably suggest that the final number of rows as 833 * 679 = 565,607. It turns out that that’s a pretty good estimate – when I ran the query the result was actually 567,018 rows.

So what does Oracle produce for the two execution plans – here are the result from 12c (EXISTS first, then IN):


===================
Multi-column EXISTS
===================
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |   920K|    34M|  1259  (11)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT SEMI|      |   920K|    34M|  1259  (11)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T2   |  1000 | 11000 |   610   (8)| 00:00:01 |
|   3 |   TABLE ACCESS FULL  | T1   |  1000K|    26M|   628  (11)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."N_400"="T1"."N_400" AND "T2"."N_3"="T1"."N_3")
   2 - filter("N_1000"=0)

===================
Equivalent IN query
===================
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |   833K|    30M|  1259  (11)| 00:00:01 |
|*  1 |  HASH JOIN RIGHT SEMI|      |   833K|    30M|  1259  (11)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T2   |  1000 | 11000 |   610   (8)| 00:00:01 |
|   3 |   TABLE ACCESS FULL  | T1   |  1000K|    26M|   628  (11)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T1"."N_400"="T2"."N_400" AND "T1"."N_3"="T2"."N_3")
   2 - filter("T2"."N_1000"=0)

The first thing to note is that the shape of the plans is identical, and the predicate sections are identical – but the final cardinalities are different. Clearly at least one of the cardinalities has to be wrong by a significant amount (7.5% or 10.4%, depending which way round you want to look at it). If you run the test on 11.2.0.4 you find that both plans give the same estimated row count – and it’s the 920,000 rows; so arguably 12c has “fixed” the IN subquery calculation, bringing it closer to a reasonable prediction, but it hasn’t fixed the EXISTS subquery calculation. That 833K prediction, by the way, is what you would expect to see with this data with a basic join – and a semi-join shouldn’t be able to produce more data than  a join.

But both predictions are way off the (informal) expectation, so how have they appeared ?

Working backwards it’s easy to spot that: 833K = 833 * 1,000: Oracle is behaving as if every single row identified in the subquery will produce a separate combination of (n_400, n_3). If we reverse engineer 920K we get: 920K / 833 = 1104 – it would appear that the optimizer thinks the 1,000 rows produced by the subquery will produce 1,104 distinct combinations of (n_400, n_3) so we how did the impossible 1,104 appear in the arithmetic.

If you apply the “selection without replacement” formula to picking 1,000 rows with 400 distinct values from 1,000,000 rows the expected number of distinct values (with rounding) will be 368; if you apply the formula for picking 1,000 rows with 3 distinct values from 1,000,000 rows the expected number will be 3. And 3 * 368 = 1,104. (Remember that in my original estimate I applied the formula after multiplying out the combination of distinct values). The optimizer is using its standard methods, but using internediate results in an unsuitable fashion.

It’s impossible to say what the impact of this particular code path – and the change on the upgrade – might be. The optimizer has over-estimated by 47% in one case and 62% in the other but (a) there may be something about my data that exaggerated an effect that few people will see in the wild and (b) in many cases getting in the right ballpark is enough to get a reasonable plan, and a factor of 2 is the right ballpark.

Of course, a few people will be unlucky with a few queries on the upgrade where the estimate changes – after all a single row difference in the estimate can cause the optimizer to flip between a hash join and a nested loop – but at least you’ve got a little extra information that might help when you see a bad estimate on an important semi-join.

So is there a workaround ? Given that I’ve got 12c, the obvious thing to try is to create a column group at both ends of the semi-join and see what happens. It shouldn’t really make any difference because column groups are targeted at the problems of correlated column – but we might as well try it:


execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for columns (n_400,n_3) size 1')
execute dbms_stats.gather_table_stats(user,'t2',method_opt=>'for columns (n_400,n_3) size 1')

Unfortunately when I did this the final cardinality estimate for both queries dropped to just 833 (the absence of a K on the end isn’t a typo!).

Manually unnesting got me closer:


select
        *
from
        (
        select  distinct n_3, n_400
        from    t2
        where   n_1000 = 0
        )       sq,
        t1
where   
        sq.n_400 = t1.n_400
and     sq.n_3 = t1.n_3
;

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |   649K|    33M|  1260  (11)| 00:00:01 |
|*  1 |  HASH JOIN           |      |   649K|    33M|  1260  (11)| 00:00:01 |
|   2 |   VIEW               |      |   779 | 20254 |   612   (8)| 00:00:01 |
|   3 |    HASH UNIQUE       |      |   779 |  8569 |   612   (8)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL| T2   |  1000 | 11000 |   610   (8)| 00:00:01 |
|   5 |   TABLE ACCESS FULL  | T1   |  1000K|    26M|   628  (11)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("SQ"."N_400"="T1"."N_400" AND "SQ"."N_3"="T1"."N_3")
   4 - filter("N_1000"=0)

The cardinality of 649K is (allowing for rounding) 833 * 779; so we need to know where the 779 came from. It’s the optimizer standard arithmetic for “distinct” – multiply the N individual selectivities together then divide by the sqrt(2) “N-1” times. So we apply the “selection without replacement formula twice”:

  • adjusted selectivity of n_400 = 367.21
  • adjusted selectivity of n_3 = 3
  • 367.21 * 3 / sqrt(2) = 779

If you create column group statistics for (n_400, n_3) this doesn’t change the optimizer’s estimate for the number of distinct combinations after selection – maybe that’s another enhancement in the pipeline – but, at least in this case, the manual unnesting has got us a little closer to the right estimates without any statistical intervention.

Footnote:

Just for the sake of completeness, here are the plans (with yet more cardinality predictions) that you get if you block the unnesting:


select 
	*
from 
	t1 
where 
	exists (
		select	
			/*+ no_unnest */
			null  
		from	t2 
		where	n_1000 = 0 
		and	t2.n_400 = t1.n_400 
		and	t2.n_3 = t1.n_3
	)
;



---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  1179 | 33012 |   766K (12)| 00:00:30 |
|*  1 |  FILTER            |      |       |       |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  1000K|    26M|   632  (11)| 00:00:01 |
|*  3 |   TABLE ACCESS FULL| T2   |     1 |    11 |   638  (12)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T2" "T2" WHERE
              "N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2))
   3 - filter("N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2)



=====================================
Unnesting blocked and subquery pushed
=====================================
select 
	*
from 
	t1 
where 
	exists (
		select	
			/*+ no_unnest push_subq */
			null  
		from	t2 
		where	n_1000 = 0 
		and	t2.n_400 = t1.n_400 
		and	t2.n_3 = t1.n_3
	)
;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 50000 |  1367K|  1271  (12)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL | T1   | 50000 |  1367K|   632  (11)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL| T2   |     1 |    11 |   638  (12)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( EXISTS (SELECT /*+ PUSH_SUBQ NO_UNNEST */ 0 FROM "T2"
              "T2" WHERE "N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2))
   2 - filter("N_1000"=0 AND "T2"."N_400"=:B1 AND "T2"."N_3"=:B2)

The 1179 comes from the magic of sqrt(2):  1179 = 1,000,000 / (400 * 3 / sqrt(2)).

The 50,000 is just the basic “I dunno, let’s call it 5%”.

 

Reference script: aggregate_selectivity_c.sql

 

August 29, 2015

Index Usage – 2

Filed under: 12c,Function based indexes,Indexing,Oracle — Jonathan Lewis @ 11:33 am GMT Aug 29,2015

I’ve been a little slow in the follow-up to my previous posting on possibly redundant indexes. Before going into the slightly more complex stuff, there’s another peripheral point (but a very important one) that’s worth raising about how clever the optimizer can be. Here’s some code for 11.2.0.4 to demonstrate the point:

create table t1
nologging
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		level <= 1e4
)
select
	rownum					id,
	trunc(sysdate,'MM') + (rownum-1)/1440	date_time,
	rpad('x',100)				padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5
;


alter table t1 
add (
        date_only
	generated always as (trunc(date_time)) virtual 
)
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

create index t1_i1 on t1(date_only) nologging;

So, in a two-step process, I’ve got an indexed virtual column that holds the value of the date_time column truncated to just the date. Would you expect the optimizer to use the index to execute the following query efficiently:


select
        max(id)
from
        t1
where
        date_time between sysdate-1 and sysdate
;

Note that the query references the real date_time column not the virtual column date_only, and it’s not using the expression that defines the index – yet the plan reads as follows:


-----------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |       |     1 |    21 |    86   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE                       |       |     1 |    21 |            |          |
|*  2 |   FILTER                              |       |       |       |            |          |
|*  3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1    |  1442 | 30282 |    86   (2)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                  | T1_I1 |  4306 |       |    13   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(SYSDATE@!>=SYSDATE@!-1)
   3 - filter("DATE_TIME"<=SYSDATE@! AND "DATE_TIME">=SYSDATE@!-1)
   4 - access("T1"."DATE_ONLY">=TRUNC(SYSDATE@!-1) AND
              "T1"."DATE_ONLY"<=TRUNC(SYSDATE@!))

It’s a little odd that even though the optimizer in the newer versions of Oracle treats many simple expressions on sysdate as constants it still checks (operation 2) that “sysdate >= sysdate – 1” but perhaps that’s just a case of a piece of generic code that isn’t worth the risk or effort of changing.

The key point, of course, is that Oracle has managed to generate some extra predicates that allow it to use the “wrong” index to get a first approximation of the result set fairly efficiently, and then used the original predicate to reduce the approximation down to the correct result set.

If you want a quick sanity check on the access predicates used for operation 4:

  • If date_time >= sysdate-1, then trunc(date_time) >= trunc(sysdate-1)
  • If date_time <= sysdate, then trunc(date_time) <= trunc(sysdate)

This style of predicate manipulation also works numeric data types, but I think its greatest benefit (or convenience) is likely to come from date data types where the data has been created with a time component but there are frequent “date-only” queries. The days of creating two indexes as a workaround for handling generated code that wants to deal with both date_time and trunc(date_time) predicates should be numbered.

Footnote:

This enhancement probably appeared in 11.2.0.2, and I first saw it described in October 2013 in this blog note by Mohamed Houri; but 12c offers a delightful little enhancement – here’s what my table looks like in the 12c version of the code:


SQL> desc t1
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ID                                     NUMBER
 DATE_TIME                              DATE
 PADDING                                VARCHAR2(100)

SQL> 

Where’s the virtual column ? The 12c version of my code had a slightly different definition for it:


alter table t1  
add (
        date_only
        invisible
        generated always as (trunc(date_time)) virtual
)
;

The transformation still works even when the virtual column is invisible. So (subject to searching for anomalies, boundary conditions and bugs) it looks as if you can change the table definition, and get the benefits of two indexes for the price of one without the application realising that anything has changed.

August 7, 2015

CBO catchup

Filed under: 12c,CBO,Oracle,Partitioning — Jonathan Lewis @ 1:10 pm GMT Aug 7,2015

It’s interesting to watch the CBO evolving and see how an enhancement in one piece of code doesn’t necessarily echo through to all the other places it seems to fit. Here’s an example of an enhancement that spoiled (or, rather, made slightly more complicated) a little demonstration I had been running for about the last 15  years  – but (in a fashion akin to another partitioning limitation) doesn’t always work in exactly the way you might expect.

I wrote a note some time ago about the way that the optimizer could pre-compute the result of what I called a “fixed subquery” (such as “select 9100 from dual”) and take advantage of the value it derived to do a better job of estimating the cardinality for a query. That’s a neat feature (although it may cause some 3rd party applications a lot of pain as plans change on the upgrade to 11.2.0.4 or 12c) but it doesn’t work everywhere you might hope.

I’m going to create two (small) tables with the same data, but one of them is going to be a simple heap table and the other is going to be partitioned by range; then I’m going to run the same queries against the pair of them and show you the differences in execution plans. First the tables:


create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        lpad(rownum,10,'0')     v1,
        rpad('x',100)           padding
from
        generator       v1
where
        rownum <= 1e4
;

create table pt1(
        id, v1, padding
)
partition by range (id) (
        partition p02000 values less than ( 2001),
        partition p04000 values less than ( 4001),
        partition p06000 values less than ( 6001),
        partition p08000 values less than ( 8001),
        partition p10000 values less than (10001)
)
as
select * from t1
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'PT1',
                granularity      =>'ALL',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

alter table  t1 add constraint  t1_pk primary key(id);
alter table pt1 add constraint pt1_pk primary key(id) using index local;

create or replace function f(i_in  number)
return number
is
begin
        return i_in;
end;
/

Note that I’ve used ‘ALL’ as my granularity option – for such small tables this should mean that the statistics at the partition and global level are as accurate as they can be. And since the data is defined to be uniform I don’t expect the partitioning to introduce any peculiarities in the optimizer’s calculations of selectivity and cardinality. I’ve created the indexes after gathering stats on the tables – this is 12c (and 11.2.0.4) so the index stats will be collected with a 100% sample as the indexes are created. Finally I’ve created a function that simply returns its numeric input.

Now let’s run a couple of queries against the simple table and check the cardinality (Rows) predicted by the optimizer – the two plans follow the code that generated them:

set serveroutput off

select  max(v1)
from    t1
where   id between (select 500 from dual)
           and     (select 599 from dual)
;

select * from table(dbms_xplan.display_cursor);

select  max(v1)
from    t1
where   id between (select f(500) from dual)
           and     (select f(599) from dual)
;

select * from table(dbms_xplan.display_cursor);

======================
Actual Execution Plans
======================

select max(v1) from t1 where id between (select 500 from dual)
  and     (select 599 from dual)

----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |       |       |       |     4 (100)|          |
|   1 |  SORT AGGREGATE                      |       |     1 |    15 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1    |   101 |  1515 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T1_PK |   101 |       |     2   (0)| 00:00:01 |
|   4 |     FAST DUAL                        |       |     1 |       |     2   (0)| 00:00:01 |
|   5 |     FAST DUAL                        |       |     1 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("ID">= AND "ID"<=)

select max(v1) from t1 where id between (select f(500) from dual)
     and     (select f(599) from dual)

----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |       |       |       |     3 (100)|          |
|   1 |  SORT AGGREGATE                      |       |     1 |    15 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1    |    25 |   375 |     3   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T1_PK |    45 |       |     2   (0)| 00:00:01 |
|   4 |     FAST DUAL                        |       |     1 |       |     2   (0)| 00:00:01 |
|   5 |     FAST DUAL                        |       |     1 |       |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("ID">= AND "ID"<=)

In the first plan the optimizer has recognised the values 500 and 599, so its range-based calculation has produced a (matching, and nearly correct) prediction of 101 rows. In the second plan the function call has hidden the values so the optimizer has had to use the arithmetic for “ranges with unknown values” – which means it uses guesses for the selectivity of 0.45% for the index and 0.25% for the table. Maybe in a future release that f(500) will be evaluated in the same way that we can trigger in-list calculation with the precompute_subquery hint.

Now we repeat the query, but using the partitioned table – showing only the trimmed output from dbms_xplan.display_cursor():

select max(v1) from pt1 where id between (select 500 from dual)
   and     (select 599 from dual)

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |        |       |       |     4 (100)|          |       |       |
|   1 |  SORT AGGREGATE                             |        |     1 |    15 |            |          |       |       |
|   2 |   PARTITION RANGE ITERATOR                  |        |   101 |  1515 |     4   (0)| 00:00:01 |   KEY |   KEY |
|   3 |    TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| PT1    |   101 |  1515 |     4   (0)| 00:00:01 |   KEY |   KEY |
|*  4 |     INDEX RANGE SCAN                        | PT1_PK |   101 |       |     2   (0)| 00:00:01 |   KEY |   KEY |
|   5 |      FAST DUAL                              |        |     1 |       |     2   (0)| 00:00:01 |       |       |
|   6 |      FAST DUAL                              |        |     1 |       |     2   (0)| 00:00:01 |       |       |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("ID">= AND "ID"<=)

select max(v1) from pt1 where id between (select f(500) from dual)
      and     (select f(599) from dual)

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name   | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |        |       |       |     3 (100)|          |       |       |
|   1 |  SORT AGGREGATE                             |        |     1 |    15 |            |          |       |       |
|   2 |   PARTITION RANGE ITERATOR                  |        |    25 |   375 |     3   (0)| 00:00:01 |   KEY |   KEY |
|   3 |    TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| PT1    |    25 |   375 |     3   (0)| 00:00:01 |   KEY |   KEY |
|*  4 |     INDEX RANGE SCAN                        | PT1_PK |    45 |       |     2   (0)| 00:00:01 |   KEY |   KEY |
|   5 |      FAST DUAL                              |        |     1 |       |     2   (0)| 00:00:01 |       |       |
|   6 |      FAST DUAL                              |        |     1 |       |     2   (0)| 00:00:01 |       |       |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("ID">= AND "ID"<=)

It’s great to see that the predicted cardinalities match the simple heap version exactly – but can you see anything odd about either of these plans ?

 

 

Pause for thought …

 

 

There’s nothing odd about the second plan but there’s a little puzzle in the first.

In theory, it seems, the optimizer is aware that the first query covers the range 500 – 599; so why are the pstart/pstop columns for operations 2-4 showing KEY-KEY, which usually means the optimizer knows that it will have some partition key values at run-time and will be able to do run-time partition elimination, but it doesn’t know what those key values are at parse time.

In this very simple case it’s (probably) not going to make any difference to the performance – but it may be worth some careful experimentation in more complex cases where you might have been hoping to get strict identification of partitions and partition-wise joins taking place. Yet another topic to put on the todo list of “pre-emptive investigations” with a reminder to re-run the tests from time to time.

 

 

July 20, 2015

12c Downgrade

Filed under: 12c,Bugs,CBO,Oracle — Jonathan Lewis @ 1:12 pm GMT Jul 20,2015

No, not really – but sometimes the optimizer gets better and gives you worse performance as a side effect when you upgrade. Here’s an example where 11.2.0.4 recognised (with a few hints) the case for a nested loop semi-join and 12c went a bit further and recognised the opportunity for doing a cunning “semi_to_inner” transformation … which just happened to do more work than the 11g plan.

Here’s a data set to get things going, I’ve got “parent” and “child” tables, but in this particular demonstration I won’t be invoking referential integrity:


create table chi
as
with generator as (
        select  --+ materialize
                rownum  id
        from dual
        connect by
                level <= 1e4
)
select
        rownum - 1                              id,
        trunc((rownum-1)/10)                    n1,
        trunc(dbms_random.value(0,1000))        n2,
        rpad('x',1000)                          padding
from
        generator
;

create table par
as
with generator as (
        select  --+ materialize
                rownum  id
        from dual
        connect by
                level <= 1e4
)
select
        rownum - 1      id,
        rpad('x',1000)  padding
from
        generator
where
        rownum <= 1e3
;

alter table par modify id not null;
alter table par add constraint par_pk primary key(id)
-- deferrable
;

-- Now gather stats on the tables.

The code uses my standard framework that could generate a few million rows even though it’s only generating 1,000 in par and 10,000 in chi. The presence of the commented “deferrable” for the primary key constraint is for a secondary demonstration.

You’ll notice that the 1,000 values that appear in chi.n1 and chi.n2 are matched by the 1,000 rows that appear in the primary key of par – in some other experiment I’ve got two foreign keys from chi to par. Take note that the values in n1 are very well clustered because of the call to trunc() while the values in n2 are evenly scattered because of the call to dbms_random() – the data patterns are very different although the data content is very similar (the randomised data will still produce, on average, 10 rows per value).

So here’s the test code:


set serveroutput off
set linesize 156
set trimspool on
set pagesize 60

alter session set statistics_level = all;

prompt  =============================
prompt  Strictly ordered driving data
prompt  =============================

select
        /*+
                leading(@sel$5da710d3 chi@sel$1 par@sel$2)
                full   (@sel$5da710d3 chi@sel$1)
                use_nl (@sel$5da710d3 par@sel$2)
                index  (@sel$5da710d3 par@sel$2 (par.id))
        */
        count(*)
from
        chi
where   exists (
                select null
                from par
                where par.id = chi.n1
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last outline alias cost'));

prompt  =============================
prompt  Randomly ordered driving data
prompt  =============================

select
        /*+
                leading(@sel$5da710d3 chi@sel$1 par@sel$2)
                full   (@sel$5da710d3 chi@sel$1)
                use_nl (@sel$5da710d3 par@sel$2)
                index  (@sel$5da710d3 par@sel$2 (par.id))
        */
        count(*)
from
        chi
where   exists (
                select null
                from par
                where par.id = chi.n2
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last outline alias cost'));

set serveroutput on
alter session set statistics_level = typical;

In both cases I’ve hinted the query quite heavily, using internally generated query block names, into running with a nested loop semi-join from chi to par. Since there are 10,000 rows in chi with no filter predicates, you might expect to see the probe into the par table starting 10,000 times returning (thanks to our perfect data match) one row for each start. Here are the run-time plans with rowsource execution stats from 11.2.0.4

=============================
Strictly ordered driving data
=============================

--------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |        |      1 |        |   190 (100)|      1 |00:00:00.14 |    1450 |   1041 |
|   1 |  SORT AGGREGATE     |        |      1 |      1 |            |      1 |00:00:00.14 |    1450 |   1041 |
|   2 |   NESTED LOOPS SEMI |        |      1 |  10065 |   190   (4)|  10000 |00:00:00.12 |    1450 |   1041 |
|   3 |    TABLE ACCESS FULL| CHI    |      1 |  10065 |   186   (2)|  10000 |00:00:00.07 |    1434 |   1037 |
|*  4 |    INDEX UNIQUE SCAN| PAR_PK |   1000 |   1048 |     0   (0)|   1000 |00:00:00.01 |      16 |      4 |
--------------------------------------------------------------------------------------------------------------

=============================
Randomly ordered driving data
=============================

-----------------------------------------------------------------------------------------------------
| Id  | Operation           | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |        |      1 |        |   190 (100)|      1 |00:00:00.12 |    5544 |
|   1 |  SORT AGGREGATE     |        |      1 |      1 |            |      1 |00:00:00.12 |    5544 |
|   2 |   NESTED LOOPS SEMI |        |      1 |  10065 |   190   (4)|  10000 |00:00:00.10 |    5544 |
|   3 |    TABLE ACCESS FULL| CHI    |      1 |  10065 |   186   (2)|  10000 |00:00:00.02 |    1434 |
|*  4 |    INDEX UNIQUE SCAN| PAR_PK |   4033 |   1048 |     0   (0)|   4033 |00:00:00.02 |    4110 |
-----------------------------------------------------------------------------------------------------

Notice how we do 1,000 starts of operation 4 when the data is well ordered, and 4,033 starts when the data is randomly ordered. For a semi-join nested loop the run-time engine uses the same caching mechanism as it does for scalar subqueries – a fact you can corroborate by removing the current hints and putting the /*+ no_unnest */ hint into the subquery so that you get a filter subquery plan, in which you will note exactly the same number of starts of the filter subquery.

As an extra benefit you’ll notice that the index probes for the well-ordered data have managed to take advantage of buffer pinning (statistic “buffer is pinned count”) – keeping the root block and most recent leaf block of the par_pk index pinned almost continually through the query; while the randomised data access unfortunately required Oracle to unpin and repin the index leaf blocks (even though there were only 2 in the index) as the scan of chi progessed.

Time to upgrade to 12.1.0.2 and see what happens:

=============================
Strictly ordered driving data
=============================

--------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |        |      1 |        |   189 (100)|      1 |00:00:00.22 |    1448 |   1456 |
|   1 |  SORT AGGREGATE     |        |      1 |      1 |            |      1 |00:00:00.22 |    1448 |   1456 |
|   2 |   NESTED LOOPS      |        |      1 |  10000 |   189   (4)|  10000 |00:00:00.20 |    1448 |   1456 |
|   3 |    TABLE ACCESS FULL| CHI    |      1 |  10000 |   185   (2)|  10000 |00:00:00.03 |    1432 |   1429 |
|*  4 |    INDEX UNIQUE SCAN| PAR_PK |  10000 |      1 |     0   (0)|  10000 |00:00:00.06 |      16 |     27 |
--------------------------------------------------------------------------------------------------------------

=============================
Randomly ordered driving data
=============================

--------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |        |      1 |        |   189 (100)|      1 |00:00:00.22 |   11588 |   1429 |
|   1 |  SORT AGGREGATE     |        |      1 |      1 |            |      1 |00:00:00.22 |   11588 |   1429 |
|   2 |   NESTED LOOPS      |        |      1 |  10000 |   189   (4)|  10000 |00:00:00.19 |   11588 |   1429 |
|   3 |    TABLE ACCESS FULL| CHI    |      1 |  10000 |   185   (2)|  10000 |00:00:00.03 |    1432 |   1429 |
|*  4 |    INDEX UNIQUE SCAN| PAR_PK |  10000 |      1 |     0   (0)|  10000 |00:00:00.07 |   10156 |      0 |
--------------------------------------------------------------------------------------------------------------

Take a close look at operation 2 – it’s no longer a NESTED LOOP SEMI, the optimizer has got so smart (recognising the nature of the primary key on par) that it’s done a “semi_to_inner” transformation. But a side effect of the transformation is that the scalar subquery caching mechanism no longer applies so we probe the par table 10,000 times. When the driving data is well-ordered this hasn’t made much difference to the buffer gets (and related latch activity), but when the data is randomised the extra probes ramp the buffer gets up even further.

The timings (A-time) on these experiments are not particularly trustworthy – the differences between cached reads and direct path reads introduced more variation than the difference in Starts and Buffers, and the total CPU load is pretty small anyway – and I suspect that this difference won’t make much difference to most people most of the time. No doubt, though, there will be a few cases where a small change like this could have a noticeable effect on some important queries.

Footnote

There is a hint /*+ no_semi_to_inner(@queryblock object_alias) */ that I thought might persuade the optimizer to stick with the semi-join, but it didn’t have any effect. Since the “semi to inner” transformation (and the associated hints) are available in 11.2.0.4 I was a little puzzled that (a) I didn’t see the same transformation in the 11g test, and (b) that I couldn’t hint the transformation.  This makes me wonder if there’s a defect in 11g that might be fixed in a future patch.

It’s also nice to think that the scalar subquery caching optimisation used in semi-joins might eventually become available  to standard joins (in cases such as “join to parent”, perhaps).

March 26, 2015

12c MView refresh

Filed under: 12c,Infrastructure,Materialized view,Oracle — Jonathan Lewis @ 1:19 pm GMT Mar 26,2015

Some time ago I wrote a blog note describing a hack for refreshing a large materialized view with minimum overhead by taking advantage of a single-partition partitioned table. This note describes how Oracle 12c now gives you an official way of doing something similar – the “out of place” refresh.

I’ll start by creating a matieralized view and creating a couple of indexes on the resulting underlying table; then show you three different calls to refresh the view. The materialized view is based on all_objects so it can’t be made available for query rewrite (ORA-30354: Query rewrite not allowed on SYS relations) , and I haven’t created any materialized view logs so there’s no question of fast refreshes – but all I intend to do here is show you the relative impact of a complete refresh.


create materialized view mv_objects nologging
build immediate
refresh on demand
as
select
        *
from
        all_objects
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'mv_objects',
		method_opt 	 => 'for all columns size 1'
	);
end;
/

create index mv_obj_i1 on mv_objects(object_name) nologging compress;
create index mv_obj_i2 on mv_objects(object_type, owner, data_object_id) nologging compress 2;

This was a default install of 12c, so there were about 85,000 rows in the view. You’ll notice that I’ve created all the objects as “nologging” – this will have an effect on the work done during some of the refreshes.

Here are the three variants I used – all declared explicitly as complete refreshes:


begin
	dbms_mview.refresh(
		list			=> 'MV_OBJECTS',
		method			=> 'C',
		atomic_refresh		=> true
	);
end;
/

begin
	dbms_mview.refresh(
		list			=> 'MV_OBJECTS',
		method			=> 'C',
		atomic_refresh		=> false
	);
end;
/

begin
	dbms_mview.refresh(
		list			=> 'MV_OBJECTS',
		method			=> 'C',
		atomic_refresh		=> false,
		out_of_place		=> true
	);
end;
/

The first one (atomic_refresh=>true) is the one you have to use if you want to refresh several materialized views simultaneously and keep them self consistent, or if you want to ensure that the data doesn’t temporarily disappear if all you’re worried about is a single view. The refresh works by deleting all the rows from the materialized view then executing the definition to generate and insert the replacement rows before committing. This generates a lot of undo and redo – especially if you have indexes on the materialized view as these have to be maintained “row by row” and may leave users accessing and applying a lot of undo for read-consistency purposes. An example at a recent client site refreshed a table of 6.5M rows with two indexes, taking about 10 minutes to refresh, generating 7GB of redo as it ran, and performing 350,000 “physical reads for flashback new”. This strategy does not take advantage of the nologging nature of the objects – and as a side effect of the delete/insert cycle you’re likely to see the indexes grow to roughly twice their optimal size and you may see the statistic “recursive aborts on index block reclamation” climbing as the indexes are maintained.

The second option (atomic_refresh => false) is quick and efficient – but may result in wrong results showing up in any code that references the materialized view (whether explicitly or by rewrite). The session truncates the underlying table, sets any indexes on it unusable, then reloads the table with an insert /*+ append */. The append means you get virtually no undo generated, and if the table is declared nologging you get virtually no redo. In my case, the session then dispatched two jobs to rebuild the two indexes – and since the indexes were declared nologging the rebuilds generated virtually no redo. (I could have declared them with pctfree 0, which would also have made them as small as possible).

The final option is the 12c variant – the setting atomic_refresh => false is mandatory if we want  out_of_place => true. With these settings the session will create a new table with a name of the form RV$xxxxxx where xxxxxx is the hexadecimal version of the new object id, insert the new data into that table (though not using the /*+ append */ hint), create the indexes on that table (again with names like RV$xxxxxx – where xxxxxx is the index’s object_id). Once the new data has been indexed Oracle will do some name-switching in the data dictionary (shades of exchange partition) to make the new version of the materialized view visible. A quirky detail of the process is that the initial create of the new table and the final drop of the old table don’t show up in the trace file  [Ed: wrong, see comment #1] although the commands to drop and create indexes do appear. (The original table, though it’s dropped after the name switching, is not purged from the recyclebin.) The impact on undo and redo generation is significant – because the table is empty and has no indexes when the insert takes place the insert creates a lot less undo and redo than it would if the table had been emptied by a bulk delete – even though the insert is a normal insert and not an append; then the index creation honours my nologging definition, so produces very little redo. At the client site above, the redo generated dropped from 7GB to 200MB, and the time dropped to 200 seconds which was 99% CPU time.

Limitations, traps, and opportunities

The manuals say that the out of place refresh can only be used for materialized views that are joins or aggregates and, surprisingly, you actually can’t use the method on a view that simply extracts a subset of rows and columns from a single table.  There’s a simple workaround, though – join the table to DUAL (or some other single row table if you want to enable query rewrite).

Because the out of place refresh does an ordinary insert into a new table the resulting table will have no statistics – you’ll have to add a call to gather them. (If you’ve previously been using a non-atomic refreshes this won’t be a new problem, of course). The indexes will have up to date statistics, of course, because they will have been created after the table insert.

The big opportunity, of course, is to change a very expensive atomic refresh into a much cheaper out of place refresh – in some special cases. My client had to use the atomic_refresh=>true option in 11g because they couldn’t afford to leave the table truncated (empty) for the few minutes it took to rebuild; but they might be okay using the out_of_place => true with atomic_refresh=>false in 12c because:

  • the period when something might break is brief
  • if something does go wrong the users won’t get wrong (silently missing) results, they’ll an Oracle error (probably ORA-08103: object no longer exists)
  • the application uses this particular materialized view directly (i.e. not through query rewrite), and the query plans are all quick, light-weight indexed access paths
  • most queries will probably run correctly even if they run through the moment of exchange

I don’t think we could guarantee that last statement – and Oracle Corp. may not officially confirm it – and it doesn’t matter how many times I show queries succeeding but it’s true. Thanks to “cross-DDL read-consistency” as it was called in 8i when partition-exchange appeared and because the old objects still exist in the data files, provided your query doesn’t hit a block that has been overwritten by a new object, or request a space management block that was zero-ed out on the “drop” a running query can keep on using the old location for an object after it has been replaced by a newer version. If you want to make the mechanism as safe as possible you can help – put each relevant materialized view (along with its indexes) into its own tablespace so that the only thing that is going to overwrite an earlier version of the view is the stuff you create on the next refresh.

 

January 26, 2015

In-memory DB

Filed under: 12c,in-memory,Infrastructure,Oracle — Jonathan Lewis @ 8:18 am GMT Jan 26,2015

A recent thread on the OTN database forum supplied some code that seemed to show that In-memory DB made no difference to performance when compared with the traditional row-store mechanism and asked why not.  (It looked as if the answer was that almost all the time for the tests was spent returning the 3M row result set to the SQL*Plus client 15 rows at a time.)

The responses on the thread led to the question:  Why would the in-memory (column-store) database be faster than simply having the (row-store) data fully cached in the buffer cache ?

Maria Colgan has addressed this question in part 3 of her series on In-Memory Database (see catalogue below), but I thought I’d repeat the basic ideas with a few imaginative numbers thrown in to give a better flavour of what’s going on. So imagine you have a table sized at 100GB, with 100 columns of data where every column holds data of a similar size and pattern; and want to execute a query of the form: select {list of columns} from big_table where colX >=  {some constant}.

Traditional Tablescan (approximation) with table fully cached

For each block of the 100GB, Oracle has to acquire the “cache buffers chains” latch, pin the block, drop the latch, and scan the block testing each row, then acquire the latch, unpin the block, and drop the latch.  Scanning the block requires a walk through the row directory and, for each row pointer, jumping to the correct location in the block for the row, stepping along the row one column at a time to get to the correct row, and then checking the column  value. If the column matches the predicate extract, format and return the required columns from that row.

It’s a lot of memory to scan, in a large number of small steps, involving a lot of latching and pinning – which translates into a lot of CPU. On the plus side, although it’s very expensive to identify the required rows, it’s very cheap to construct and return a row once you’ve identified it.

In-memory scan (approximation)

  1. Given the way I’ve described the table (100GB, 100 similar columns), Oracle can recreate it in memory as 100 lists of 1GB each; so we can identify the rows we want by scanning one of those lists and applying the predicate – so only 1GB of (fairly contigious) memory to scan, rather than 100GB, and virtually no latching and pinning to find that memory, and no jumping around following pointers and counting along rows.
  2. But it’s probably NOT 1GB of memory to scan, because Oracle has some compression/deduplication methods that it can apply to the data that often reduces the memory requirement by a factor of four of five – so perhaps it’s only 250 MB of memory to scan.
  3. But Oracle breaks large lists into chunks, so rather than 250MB of contiguous memory, it’s likely to be 250 chunks of 1MB; and as part of the description of each chunk Oracle records the lowest and highest value in the chunk; [ed:  Christian Antognini says that the metadata list every distinct value for the chunk] so it can check the predicate against the boundary values on the chunk and be in a position to discard entire chunks without looking at their content. So, depending on the actual content and pattern of the data, we may examine only a handful of chunks in detail, dropping the scan from 250MB to, perhaps, 10MB.
  4. And we still haven’t finished there, because the code that handles the column-store is able to take advantage of the SIMD (Single Instruction,  Multiple Data) operations in the CPU to check the predicate against 4, or possibly even 8, values simultaneously at a speed matching a single column comparison that has to be used for the traditional cached row-store. (Given the way that Oracle  (probably) handles the compression, I suspect that this final advantage is only significant for range-based predicates – but that’s purely conjectural).

So the huge benefit you get from In-Memory column store, compared to “fully cached row-store”  is that you are likely to scan far less memory to identify the rows that match your predicate, and do it with far less “infrastructure” activity like latching and pinning. The potential saving in CPU usage is huge.

There is, of course, a penalty to pay. As you identify the rows of interest you can (in effect) construct a bitmap representing the position of those rows in the table (and if you have predicates on more than 1 column you can use bitmap operations on the individual column bitmaps to identify the rows you want in the final result) but then you have to construct the row that goes into the result set. If your query is interested in just 5 columns that means using the bitmap to locate the correct entry from each of 5 separate column lists; if your query is interested in 99 column that means extracting the correct entry from each of 99 separate column lists. Identifying the rows you want can be very  quick, building the final result may be relatively slow.

Soundbite summary

  • Using the In-memory Database, you can identify the rows you want very quickly but it’s relatively slow to reconstruct them.
  • Using a fully cached traditional row-store, it’s relatively slow to identify the rows you want, but once you’ve found them you spend no time reconstructing them.

Bear in mind that this is an extremely simplified analysis and ignores all sorts of details about read-consistency, the probability of physical reads, the probability of scanning blocks instead of scanning chunks, and so on; my focus is only on the type of activity that differentiates row-store handling from column-store handling when all the data is in memory so that you can have some appreciation of why the benefits available from In-memory DB can vary with the pattern of the data and the way you use it.

Catalogue of blog posts by Maria Colgan and Andy Rivenes:

Other articles on In-memory DB:

 

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 5,934 other followers