Oracle Scratchpad

January 26, 2016

Trace file size

Filed under: 12c,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 8:30 am BST Jan 26,2016

Here’s a convenient enhancement for tracing that came up on Twitter a few days ago – first in a tweet that I retweeted, then in a question from Christian Antognini based on this bit of the 12c Oracle documentation (opens in separate tab). The question was – does it work for you ?

The new description for max_dump_file_size says that for large enough values Oracle will split the file into multiple chunks of a few megabytes, using a suffix to identify the sequence of the chunks, keeping only the first chunk and the most recent chunks. Unfortunately this doesn’t seem to be true. However, prompted by Chris’ question I ran a quick query against the full parameter list looking for parameters with the word “trace” in their name:


select
        /*+
                leading(nam val val2)
                full(name)
                full(val)  use_hash(val)  no_swap_join_inputs(val)
                full(val2) use_hash(val2) no_swap_join_inputs(val2)
        */
        nam.ksppinm                             name,
        val.ksppstvl                            ses_val,
        val2.ksppstvl                           sys_val,
        nam.ksppdesc                            description,
        nam.indx+1                              numb,
        nam.ksppity                             type,
        val.ksppstdf                            is_def,
        decode(bitand(nam.ksppiflg/256,1),
                1,'True',
                  'False'
        )                                       ses_mod,
        decode(bitand(nam.ksppiflg/65536,3),
                1,'Immediate',
                2,'Deferred' ,
                3,'Immediate',
                  'False'
        )                                       sys_mod,
        decode(bitand(val.ksppstvf,7),
                1,'Modified',
                4,'System Modified',
                  'False'
        )                                       is_mod,
        decode(bitand(val.ksppstvf,2),
                2,'True',
                  'False'
        )                                       is_adj,
        val.ksppstcmnt                          notes
from
        x$ksppi         nam,
        x$ksppcv        val,
        x$ksppsv        val2
where
        nam.indx = val.indx
and     val2.indx = val.indx
and     ksppinm like '%&m_search.%'
order by
        nam.ksppinm
;

Glancing through the result I spotted a couple of interesting parameters with the letters “uts” in their names, so re-ran my query looking for all the “uts” parameters, getting the following (edited) list:


NAME                           SYS_VAL         DESCRIPTION    
------------------------------ --------------- ---------------------------------------------
_diag_uts_control              0               UTS control parameter
_uts_first_segment_retain      TRUE            Should we retain the first trace segment
_uts_first_segment_size        0               Maximum size (in bytes) of first segments 
_uts_trace_disk_threshold      0               Trace disk threshold parameter
_uts_trace_segment_size        0               Maximum size (in bytes) of a trace segment
_uts_trace_segments            5               Maximum number of trace segments 

Note particularly the “first segment size” and “trace segment size” – defaulting to zero (which often means a hidden internal setting, though that doesn’t seem to be the case here, but maybe that’s what the “diag control” is for). I haven’t investigated all the effects, but after a little experimentation I found that all I needed to do to get the behaviour attributed to max_dump_file_size was to set the following two parameters – which I could do at the session level.


alter session set "_uts_first_segment_size" = 5242880;
alter session set "_uts_trace_segment_size" = 5242880;

The minimum value for these parameters is the one I’ve shown above (5120 KB) and with the default value for _uts_trace_segments you will get a maximum of 5 trace files with sequential names like the following:

ls -ltr *4901*.trc

-rw-r----- 1 oracle oinstall 5243099 Jan 26 08:15 orcl_ora_4901_1.trc
-rw-r----- 1 oracle oinstall 5243064 Jan 26 08:15 orcl_ora_4901_12.trc
-rw-r----- 1 oracle oinstall 5243058 Jan 26 08:15 orcl_ora_4901_13.trc
-rw-r----- 1 oracle oinstall 5242993 Jan 26 08:15 orcl_ora_4901_14.trc
-rw-r----- 1 oracle oinstall 1363680 Jan 26 08:15 orcl_ora_4901.trc

As you can see I’m currently generating my 15th trace, and Oracle has kept the first one and the previous three. It’s always working on a file with no suffix to its name but as soon as that file hits its limiting size (plus or minus a few bytes) it gets its appropriate suffix, the oldest file is deleted, and a new trace file without a suffix is started.

Apart from the usual header information the trace files start and end with lines like:

*** TRACE CONTINUED FROM FILE /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_11.trc ***
  
*** TRACE SEGMENT RENAMED TO /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_4901_12.trc ***

There is one little trap to watch out for: if you set either of these parameters to be larger than max_dump_file_size tracing stops as soon as one of the segments hits the max_dump_file_size and that trace file ends with the usual “overflow” message – e.g, when I changed the max_dump_file_size to 4M in mid-session:

*** DUMP FILE SIZE IS LIMITED TO 4194304 BYTES ***

In my case I had started with max_dump_file_size set to 20M, so I got lucky with my choice of 5M as the segment size.

Further investigation is left as an exercise to the interested reader.

 

January 24, 2016

Semijoin_driver

Filed under: bitmaps,CBO,Execution plans,Oracle,subqueries — Jonathan Lewis @ 11:42 am BST Jan 24,2016

Here’s one of those odd little tricks that (a) may help in a couple of very special cases and (b) may show up at some future date – or maybe it already does – in the optimizer if it is recognised as a solution to a more popular problem. It’s about an apparent restriction on how the optimizer uses the BITMAP MERGE operation, and to demonstrate a very simple case I’ll start with a data set with just one bitmap index:


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1000)        n1,
        rpad('x',10,'x')        small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create bitmap index t1_b1 on t1(n1);

select index_name, leaf_blocks, num_rows from user_indexes;

/*
INDEX_NAME           LEAF_BLOCKS   NUM_ROWS
-------------------- ----------- ----------
T1_B1                        500       1000
*/

Realistically we don’t expect to use a single bitmap index to access data from a large table, usually we expect to have queries that give the optimizer the option to choose and combine several bitmap indexes (possibly driving through dimension tables first) to reduce the target row set in the table to a cost-effective level.

In this example, though, I’ve created a column data set that many people might view as “inappropriate” as the target for a bitmap index – in one million rows I have one thousand distinct values, it’s not a “low cardinality” column – but, as Richard Foote (among others) has often had to point out, it’s wrong to think that bitmap indexes are only suitable for columns with a very small number of distinct values. Moreover, it’s the only index on the table, so no chance of combining bitmaps.

Another thing to notice about my data set is that the n1 column has been generated by the mod() function; because of this the column cycles through the 1,000 values I’ve allowed for it, and this means that the rows for any given value are scattered widely across the table, but it also means that if I find a row with the value X in it then there could well be a row with the value X+4 (say) in the same block.

I’ve reported the statistics from user_indexes at the end of the sample code. This shows you that the index holds 1,000 “rows” – i.e. each key value requires only one bitmap entry to cover the whole table, with two rows per leaf block.  (By comparison, a B-tree index oon the column was 2,077 leaf block uncompressed, or 1,538 leaf blocks when compressed).

So here’s the query I want to play with, followed by the run-time execution plan with stats (in this case from a 12.1.0.2 instance):


alter session set statistics_level = all;

select
        /*+
                qb_name(main)
        */
        max(small_vc)
from
        t1
where
        n1 in (1,5)
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last outline'));

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |       |      1 |        |      1 |00:00:00.03 |    2006 |      4 |
|   1 |  SORT AGGREGATE                       |       |      1 |      1 |      1 |00:00:00.03 |    2006 |      4 |
|   2 |   INLIST ITERATOR                     |       |      1 |        |   2000 |00:00:00.03 |    2006 |      4 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      2 |   2000 |   2000 |00:00:00.02 |    2006 |      4 |
|   4 |     BITMAP CONVERSION TO ROWIDS       |       |      2 |        |   2000 |00:00:00.01 |       6 |      4 |
|*  5 |      BITMAP INDEX SINGLE VALUE        | T1_B1 |      2 |        |      2 |00:00:00.01 |       6 |      4 |
------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access(("N1"=1 OR "N1"=5))

The query is selecting 2,000 rows from the table, for n1 = 1 and n1 = 5, and the plan shows us that the optimizer probes the bitmap index twice (operation 5), once for each value, fetching all the rows for n1 = 1, then fetching all the rows for n1 = 5. This entails 2,000 buffer gets. However, we know that for every row where n1 = 1 there is another row nearby (probably in the same block) where n1 = 5 – it would be nice if we could pick up the 1 and the 5 at the same time and do less work.

Technically the optimizer has the necessary facility to do this – it’s known as the BITMAP MERGE – Oracle can read two or more entries from a bitmap index, superimpose the bits (effectively a BITMAP OR), then convert to rowids and visit the table. Unfortunately there are cases (and it seems to be only the simple cases) where this doesn’t appear to be allowed even when we – the users – can see that it might be a very effective strategy. So can we make it happen – and since I’ve asked the question you know that the answer is almost sure to be yes.

Here’s an alternate (messier) SQL statement that achieves the same result:


select
        /*+
                qb_name(main)
                semijoin_driver(@subq)
        */
        max(small_vc)
from
        t1
where
        n1 in (
                select /*+ qb_name(subq) */
                        *
                from    (
                        select 1 from dual
                        union all
                        select 5 from dual
                        )
        )
;

-----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |       |      1 |        |      1 |00:00:00.02 |    1074 |       |       |          |
|   1 |  SORT AGGREGATE                      |       |      1 |      1 |      1 |00:00:00.02 |    1074 |       |       |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |   2000 |   2000 |00:00:00.02 |    1074 |       |       |          |
|   3 |    BITMAP CONVERSION TO ROWIDS       |       |      1 |        |   2000 |00:00:00.01 |       6 |       |       |          |
|   4 |     BITMAP MERGE                     |       |      1 |        |      1 |00:00:00.01 |       6 |  1024K|   512K| 8192  (0)|
|   5 |      BITMAP KEY ITERATION            |       |      1 |        |      2 |00:00:00.01 |       6 |       |       |          |
|   6 |       VIEW                           |       |      1 |      2 |      2 |00:00:00.01 |       0 |       |       |          |
|   7 |        UNION-ALL                     |       |      1 |        |      2 |00:00:00.01 |       0 |       |       |          |
|   8 |         FAST DUAL                    |       |      1 |      1 |      1 |00:00:00.01 |       0 |       |       |          |
|   9 |         FAST DUAL                    |       |      1 |      1 |      1 |00:00:00.01 |       0 |       |       |          |
|* 10 |       BITMAP INDEX RANGE SCAN        | T1_B1 |      2 |        |      2 |00:00:00.01 |       6 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
  10 - access("N1"="from$_subquery$_002"."1")

Key points from this plan – and I’ll comment on the SQL in a moment: The number of buffer visits is roughly halved (In many cases we picked up two rows as we visited each buffer); operation 4 shows us that we did a BITMAP MERGE, and we can see in operations 5 to 10 that we did a BITMAP KEY ITERATION (which is a bit like a nested loop join – “for each row returned by child 1 (operation 6) we executed child 2 (operation 10)”) to probe the index twice and get two strings of bits that operation 4 could merge before operation 3 converted to rowids.

For a clearer picture of how we visit the table, here are the first few rows and last few rows from a version of the two queries where we simply select the ID column rather than aggregating on the small_vc column:

select  id from ...

Original query structure
         1
      1001
      2001
      3001
...
    997005
    998005
    999005

2000 rows selected.

Modified query structure:

         1
         5
      1001
      1005
      2001
      2005
...
    998001
    998005
    999001
    999005
    
2000 rows selected.

As you can see, one query returns all the n1 = 1 rows then all the n1 = 5 rows while the other query alternates as it walks through the merged bitmap. You may recall the Exadata indexing problem (now addressed, of course) from a few years back where the order in which rows were visited after a (B-tree) index range scan made a big difference to performance. This is the same type of issue – when the optimizer’s default plan gets the right data in the wrong order we may be able to find ways of modifying the SQL to visit the data in a more efficient order. In this case we save only fractions of a second because all the data is buffered, but it’s possible that in a production environment with much larger tables many, or all, of the re-visits could turn into physical reads.

Coming back to the SQL, the key to the re-write is to turn my IN-list into a subquery, and then tell the optimizer to use that subquery as a “semijoin driver”. This is essentially the mechanism used by the Star Tranformation, where the optimizer rewrites a simple join so that each dimension table (typically) appears twice, first as an IN subquery driving the bitmap selection then as a “joinback”. But (according to the manuals) a star transformation requires at least two dimension tables to be involved in a join to the central fact table – and that may be why the semi-join approach is not considered in this (and slightly more complex) cases.

 

 

My reference: bitmap_merge.sql, star_hack3.sql

January 18, 2016

Drop Column

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 8:14 am BST Jan 18,2016

I published a note on AllthingsOracle a few days ago discussing the options for dropping a column from an existing table. In a little teaser to a future article I pointed out that dropping columns DOESN’T reclaim space; or rather, probably doesn’t, and even if it did you probably won’t like the way it does it.

I will  be writing about “massive deletes” for AllthingsOracle in the near future, but I thought I’d expand on the comment about not reclaiming space straight away. The key point is this – when you drop a column you are probably dropping a small fraction of each row. (Obviously there are some extreme variants on the idea – for example, you might have decided to move a large varchar2() to a separate table with shared primary key).

If you’ve dropped a small fraction of each row you’ve freed up a small fraction of each block, which probably means the block hasn’t been identified as having available free space for inserts. In many cases this is probably  a good thing – because it’s quite likely the if every block in your table is suddenly labelled as having sufficient free space for new row then you could end up with a difficult and ongoing performance problem.

Many large tables have a “time-based” component to their usage – as time passes the most recently entered rows are the ones that get most usage, and older rows are no longer accessed; this means you get a performance benefit from caching because the most useful fractions of such tables are often well cached and the “interesting” data is fairly well clustered.

In a case like this, imagine what will happen if EVERY block in your table suddenly acquires enough free space to accept a couple of new rows – over the next few days the incoming data will be spread across the entire length of the table, and for the next couple of months, or years, you will have to keep the entire table cached in memory if the performance is to stay constant; moreover the clustering_factor of the most useful indexes is likely to jump from “quite small” to “absolutely massive”, and the optimizer will start changing lots of plans because it will decide that your favourite indexes are probably much to expensive to user.

I am, of course, painting a very grim picture – but it is a possible scenario that should be considered before you drop a column from a table. Combined with my observations about the locking and overheads of dropping a column you might (probably ought to) decide that you should never drop a column you should only mark it as unused or (better still if you’re on 12c) mark it invisible for a while before marking it unused. You can worry about space reclamation at a later date when you considered all the ramifications of how it might impact on performance.

Footnote: If you’re still using freelist management then dropping a column won’t put a block on the freelist until the total used space in the block falls below the value dictated by pctused (default 40%); if you’re using ASSM then the block doesn’t become available for reuse until (by default) the free space exceeds 25% of the block’s usable space.

 

 

January 11, 2016

Subquery Effects

Filed under: Bugs,CBO,Execution plans,Oracle,subqueries — Jonathan Lewis @ 12:50 pm BST Jan 11,2016

Towards the end of last year I used a query with a couple of “constant” subqueries as a focal point for a blog note on reading parallel execution plans. One of the comments on that note raised a question about cardinality estimates and, coincidentally, I received an email about the cost calculations for a similar query a few days later.

Unfortunately there are all sorts of anomalies, special cases, and changes that show up across versions when subqueries come into play – it’s only in recent versions of 11.2, for example, that a very simple example I’ve got of three equivalent statements that produce the same execution plan report the same costs and cardinality. (The queries are:  table with IN subquery, table with EXISTS subquery, table joined to “manually unnested” subquery – the three plans take the unnested subquery shape.)

I’m just going to pick out one particular anomaly, which is a costing error with multiple subqueries when “OR-ed”. Here’s my sample data set:


create table t1
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 20000
;


create table t2
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 25000
;

create table t3
nologging
as
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from dual
connect by
        level <= 30000
;
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t1',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t2',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'t3',
                method_opt       => 'for all columns size 1'
        );
end;
/

The three tables are slightly different sizes so that it will be easy to see different costs of tablescans, and there are no indexes to everything I do in the queries will be tablescans. Here are six queries I’m going to test – they all scan t1, with “constant” subqueries against t2 and/or t3. The first pair is just to show you the basic cost of the query with a single subquery, the second pair shows you the default action with two subqueries in two different orders, the final pair shows you what happens with two subqueries when you block subquery pushing.


select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     t1.n2 > (select avg(t2.n2) from t2)
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     t1.n3 > (select avg(t3.n3) from t3)
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n2 > (select avg(t2.n2) from t2)
         or t1.n3 > (select avg(t3.n3) from t3)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n3 > (select avg(t3.n3) from t3)
         or t1.n2 > (select avg(t2.n2) from t2)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
         or t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
        )
;

select
        max(t1.n1)
from
        t1
where
        t1.n1 > 10000
and     (
            t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
         or t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
        )
;

Here are the first two plans, pulled from memory (which you might have guessed thanks to the “disappearing subquery predicate” in the predicate section. These examples came from 12.1.0.2, but the same happens in 11.2.0.4:


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   111 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   500 |  5000 |    49   (3)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND "T1"."N2">))

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   123 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    10 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   500 |  5000 |    49   (3)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND "T1"."N3">))

As you can see, the cost of the query is the cost of the t1 tablescan plus the cost of running the t2 or t3 subquery once: 111 = 49 + 62, and 123 = 49 + 74.

(As a general guideline, recent versions of the optimizer tend to allow for subqueries by including “cost of subquery” * “number of times the optimizer thinks it will execute” – in this case the optimizer knows that the subquery will run exactly once).

But what happens when we test the query that applies BOTH subqueries to the tablescan ?


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |    50 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   975 | 14625 |    50   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND ("T1"."N2"> OR "T1"."N3">)))


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |    50 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   TABLE ACCESS FULL  | T1   |   975 | 14625 |    50   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">10000 AND ("T1"."N3"> OR "T1"."N2">)))

The cost of the query in both cases is just the cost of the tablescan of t1 – the subqueries are, apparently, free. You can check from the predicate section, by the way, that the subqueries are applied in the order they appear in original statement.

Does anything change if the subqueries are not pushed ?


-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   111 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   FILTER             |      |       |       |            |          |
|*  3 |    TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   4 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   5 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
|   6 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N2"> OR "T1"."N3">))
   3 - filter("T1"."N1">10000)

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   124 (100)|          |
|   1 |  SORT AGGREGATE      |      |     1 |    15 |            |          |
|*  2 |   FILTER             |      |       |       |            |          |
|*  3 |    TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   4 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   5 |     TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   6 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |     TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N3"> OR "T1"."N2">))
   3 - filter("T1"."N1">10000)

The two plans have different costs – and the cost is the cost of the tablescan of t1 plus the cost of just the first subquery in the filter predciate list.

The non-pushed subqueries show up another anomaly: you’ll notice that the t1 tablescan reports 10,001 rows cardinality, but the FILTER operation doesn’t have an associated cardinality so we can’t see how many rows the optimizer thinks will survive the subqueries. So let’s run a query that allows us to see the surviving row estimate:


select
        max(n1)
from
        (
        select
                /*+ no_eliminate_oby */
                t1.n1
        from
                t1
        where
                t1.n1 > 10000
        and     (
                   t1.n3 > (select /*+ no_push_subq */ avg(t3.n3) from t3)
                or t1.n2 > (select /*+ no_push_subq */ avg(t2.n2) from t2)
                )
        order by
                n1
        )
;

-------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |       |       |   126 (100)|          |
|   1 |  SORT AGGREGATE        |      |     1 |    13 |            |          |
|   2 |   VIEW                 |      | 10001 |   126K|   126   (5)| 00:00:01 |
|   3 |    SORT ORDER BY       |      | 10001 |   146K|   126   (5)| 00:00:01 |
|*  4 |     FILTER             |      |       |       |            |          |
|*  5 |      TABLE ACCESS FULL | T1   | 10001 |   146K|    50   (4)| 00:00:01 |
|   6 |      SORT AGGREGATE    |      |     1 |     5 |            |          |
|   7 |       TABLE ACCESS FULL| T3   | 30000 |   146K|    74   (3)| 00:00:01 |
|   8 |      SORT AGGREGATE    |      |     1 |     5 |            |          |
|   9 |       TABLE ACCESS FULL| T2   | 25000 |   122K|    62   (4)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter(("T1"."N3"> OR "T1"."N2">))
   5 - filter("T1"."N1">10000)

As you can see, the SORT ORDER BY operation thinks it’s going to handle 10,0001 rows – it looks as if the optimizer arithmetic hasn’t applied the usual subquery guess of 5% for the two subqueries. (When the subqueries were automatically pushed you saw a cardinality of 975 – which is 5% for subquery t2 plus (due to OR) 5% for subquery t3 minus 5% of 5% (=25) for the overlap – this is the standard OR arithmetic)

tl;dr

Although the optimizer code has been enhanced in many places for dealing with subquery estimates, but there are still some odd errors and inconsistencies that you need to be aware of. The examples I’ve shown may not be particularly significant in terms of what they do, but the pattern is one that you may recognise in more complex queries.

 

Reference script: subq_cost_anomaly_2.sql

 

January 8, 2016

CTEs and Updates

Filed under: Execution plans,Oracle,Subquery Factoring,Tuning — Jonathan Lewis @ 1:01 pm BST Jan 8,2016

An important target of trouble-shooting, particularly when addressing performance problems, is to minimise the time and effort you have to spend to get a “good enough” result. A recent question on the OTN database forum struck me as a good demonstration of following this strategy; the problem featured a correlated update that had to access a view 84 times to update a small table; but the view was a complex view (apparently non-mergeable) and the update took several hours to complete even though the view, when instantiated, held only 63 rows.

The OP told us that the query “select * from view” took seven minutes to return those 63 rows, and wanted to know if we could find a nice way to perform the update in (approximately) that seven minutes, rather than using the correlated update approach that seemed to take something in the ballpark of 7 minutes per row updated.

Of course the OP could have given us all the details of the view definition, all the table and index definitions, with stats etc. and asked us if we could make the update run faster – but that could lead to a long and frustrating period of experimentation and testing, and a solution that might increase the general maintenance costs of the system (because a subsequent modification to the view might then have to be echoed into the code that did the update). Setting a strictly limited target that clearly ought to be achievable is (if nothing else) a very good starting point for improving the current situation.

I don’t know (as at the time of writing) if the OP implemented the strategy I suggested, but from his description it looked as if it should have been simple to use subquery factoring with materialization to achieve the required result in the most elegant way possible (meaning, in this case, simple SQL and no change to any surrounding code).

The OP has responded to my suggestion with a comment that “it didn’t work”, but it appeared to me that they were looking at and mis-interpreting the output from a call to “Explain Plan” rather than testing the query and pulling the plan from memory – so I thought I’d build a simple model to demonstrate the principle and show you how you could confirm (beyond just checking the clock) that the strategy had worked.

We start with a table to update, a non-mergeable view, and two tables to make up the non-mergeable view:


create table t1
as
select
        trunc((rownum-1)/15)    n1,
        trunc((rownum-1)/15)    n2,
        rpad(rownum,180)        v1
from
        dual
connect by
        level <= 3000
;


create table t2
as
select
        mod(rownum,200)         n1,
        mod(rownum,200)         n2,
        rpad(rownum,180)        v1
from
        dual
connect by
        level <= 3000;
create index t1_i1 on t1(n1);
create index t2_i1 on t2(n1);

begin
        dbms_stats.gather_table_stats(
                user,
                't1',
                method_opt => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                user,
                't2',
                method_opt => 'for all columns size 1'
        );
end;
/

create or replace view v1
as
select distinct
        t1.n1 t1n1, t1.n2 t1n2, t2.n2 t2n2
from
        t1, t2
where
        t1.n1 = t2.n1
;

create table t3
as
select * from v1
;

begin
        dbms_stats.gather_table_stats(
                user,
                't3',
                method_opt => 'for all columns size 1'
        );
end;
/

I’ve created the table t3 by copying the content of the view v1 and I’m going to update every row in t3 from v1; I gathered stats on t1 and t2 before creating the view and table simply to avoid the need for Oracle to do dynamic sampling as it created t3. Depending on your version of Oracle, of course, the stats collections might be redundant.

Having set the scene with the data, here’s the “original” code for doing the required update, followed by its execution plan (pulled from the memory of a 12.1.0.2 instance):


set serveroutput off
set linesize 180
set trimspool on

alter session set statistics_level = all;

spool cte_update

update t3
        set t2n2 = (
                select  v1.t2n2
                from    v1
                where   v1.t1n1 = t3.t1n1
                and     v1.t1n2 = t3.t1n2
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));

---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT                         |       |      1 |        |      0 |00:00:01.22 |   46745 |       |       |          |
|   1 |  UPDATE                                  | T3    |      1 |        |      0 |00:00:01.22 |   46745 |       |       |          |
|   2 |   TABLE ACCESS FULL                      | T3    |      1 |    200 |    200 |00:00:00.01 |       3 |       |       |          |
|   3 |   VIEW                                   | V1    |    200 |      1 |    200 |00:00:01.22 |   46332 |       |       |          |
|   4 |    SORT UNIQUE                           |       |    200 |      1 |    200 |00:00:01.21 |   46332 |  2048 |  2048 | 2048  (0)|
|   5 |     NESTED LOOPS                         |       |    200 |      1 |  45000 |00:00:01.11 |   46332 |       |       |          |
|   6 |      NESTED LOOPS                        |       |    200 |      1 |  45000 |00:00:00.34 |    1332 |       |       |          |
|*  7 |       TABLE ACCESS BY INDEX ROWID BATCHED| T1    |    200 |      1 |   3000 |00:00:00.02 |     684 |       |       |          |
|*  8 |        INDEX RANGE SCAN                  | T1_I1 |    200 |     15 |   3000 |00:00:00.01 |     408 |       |       |          |
|*  9 |       INDEX RANGE SCAN                   | T2_I1 |   3000 |      1 |  45000 |00:00:00.11 |     648 |       |       |          |
|  10 |      TABLE ACCESS BY INDEX ROWID         | T2    |  45000 |      1 |  45000 |00:00:00.31 |   45000 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - filter("T1"."N2"=:B1)
   8 - access("T1"."N1"=:B1)
   9 - access("T2"."N1"=:B1)
       filter("T1"."N1"="T2"."N1")

Points to note from this execution plan: the VIEW operation at line 3 has started 200 times (there are 200 rows in table t3, the subquery runs once per row); and a simple measure of work done is the 46,745 buffer visits (of which, I can tell you, roughly 400 are current block gets) reported under Buffers in the top line of the plan.

It’s an interesting detail that although Oracle has pushed the correlation predicates inside the view (as shown by the predicate section for operations 7,8 and 9) it doesn’t report the operation at line 3 as “VIEW PUSHED PREDICATE”. It would be nice to see the explicit announcement of predicate pushing here, but that seems to be an expression reserved for pushing join predicates into views – fortunately we always check the predicate section, don’t we!

Now let’s see what the SQL and plan look like if we want Oracle to create the entire v1 result set and use that to update the t3 table.

update t3 
        set t2n2 = (
                with v0 as (
                        select
                                /*+ materialize */
                                t1n1, t1n2, t2n2
                        from v1
                )
                select
                        t2n2
                from
                        v0
                where   v0.t1n1 = t3.t1n1
                and     v0.t1n2 = t3.t1n2
        )
;

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name                       | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | UPDATE STATEMENT            |                            |      1 |        |      0 |00:00:00.19 |    1185 |      1 |      1 |       |       |          |
|   1 |  UPDATE                     | T3                         |      1 |        |      0 |00:00:00.19 |    1185 |      1 |      1 |       |       |          |
|   2 |   TABLE ACCESS FULL         | T3                         |      1 |    200 |    200 |00:00:00.01 |       3 |      0 |      0 |       |       |          |
|   3 |   TEMP TABLE TRANSFORMATION |                            |    200 |        |    200 |00:00:00.18 |     778 |      1 |      1 |       |       |          |
|   4 |    LOAD AS SELECT           |                            |      1 |        |      0 |00:00:00.01 |     171 |      0 |      1 |  1040K|  1040K|          |
|   5 |     VIEW                    | V1                         |      1 |  45000 |    200 |00:00:00.01 |     168 |      0 |      0 |       |       |          |
|   6 |      HASH UNIQUE            |                            |      1 |  45000 |    200 |00:00:00.01 |     168 |      0 |      0 |  1558K|  1558K| 3034K (0)|
|*  7 |       HASH JOIN             |                            |      1 |  45000 |  45000 |00:00:00.01 |     168 |      0 |      0 |  1969K|  1969K| 1642K (0)|
|   8 |        TABLE ACCESS FULL    | T1                         |      1 |   3000 |   3000 |00:00:00.01 |      84 |      0 |      0 |       |       |          |
|   9 |        TABLE ACCESS FULL    | T2                         |      1 |   3000 |   3000 |00:00:00.01 |      84 |      0 |      0 |       |       |          |
|* 10 |    VIEW                     |                            |    200 |  45000 |    200 |00:00:00.17 |     603 |      1 |      0 |       |       |          |
|  11 |     TABLE ACCESS FULL       | SYS_TEMP_0FD9D6618_911FB4C |    200 |  45000 |  40000 |00:00:00.08 |     603 |      1 |      0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   7 - access("T1"."N1"="T2"."N1")
  10 - filter(("V0"."T1N1"=:B1 AND "V0"."T1N2"=:B2))

The headline figure to note is that 1,185 Buffer visits – clearly we’ve done something very different (and possibly cheaper and faster, even in this tiny demonstration). Looking at operation 3 we see the “TEMP TABLE TRANSFORMATION”, which tells us that we’ve materialized our factored subquery. There is scope, though, for a little ambiguity and uncertainty – the Starts column for this operation says we started it 200 times, once for each row in t3. We might worry that we’ve actually recreated the result and written it to disc 200 times even though we might then notice that lines 4 – 9 tell us that we loaded the temporary table just once (Starts = 1).

You could take my word for it that we didn’t “do” the temp table transformation 200 time, we merely used the result of the temp table transformation 200 times; but I wasn’t prepared to make this assumption until I had done a little more checking, so there’s no reason why you shouldn’t still be a little suspicious. Lines 4 – 9 do seem to tell us (consistently) that we only load the data once, but there have been occasional bugs where counters have been reset to zero when they shouldn’t have been, so the fact that we see (for example, at operation 8) “1 full tablescan of t1 returning 3,000 rows after visiting 84 buffers” may mean that Oracle counted the work once and “forgot” to count it the other 199 times.

It’s easy enough to do a quick cross-check. Take a snapshot of v$mystat joined to v$statname before and after runnning the query, and check the difference in buffer visits, tablescans, and tablescan rows gotten – if those figures are broadly consistent with the figures in the execution plan I think we can be reasonably confident that the plan is telling us the truth.

Here’s what we get for a few key figures:

Name                                       Value
----                                       -----
session logical reads                      1,472
db block gets                                412
consistent gets                            1,060
consistent gets from cache                 1,060
db block changes                             410
table scans (short tables)                   205
table scan rows gotten                    46,213
table scan blocks gotten                     366

There are a number of oddities – not to mention version and feature dependent variations – in the numbers and a couple of discrepancies introduced by the code I was using to take the snapshot, but the “table scan rows gotten” figure is particularly easy to see in the execution plan:

46,213 = 3000 (t1) + 3000 (t2) + 200 (t3) + 200 * 200 (temp table)

With a small error the number of “table scans (short tables)” is also consistent with the plan Starts – and that’s perhaps the most important indicator, we scan t1 and t2 just once, and the temp table result 200 times. If we were creating the temp table 200 times we’d have to have done over 400 table scans (200 each for t1 and t2).

I won’t go into the details of how to compare the session logical I/O to the total Buffer gets for the plan – but the figures are in the right ballpark as far as matching is concerned – if the plan was deceiving us about the number of times the temporary table was created (rather than used) the session stats would have to report a figure more like 33,600 (200 * (84 + 84)) consistent gets.

Conclusion

We have managed to reduce the workload from “one view instantiation per row” to “one view instantiation” with a very small change to the SQL. In the case of the OP this should result in a small, easily comprehensible, change in the SQL statement leading to a drop in run-time from several hours to seven minutes – and maybe that’s good enough for the present.

Reference Script: cte_update.sql

 

January 6, 2016

NLS Mess

Filed under: Bugs,CBO,Execution plans,Function based indexes,Indexing,Oracle — Jonathan Lewis @ 1:18 pm BST Jan 6,2016

The Oracle database has all sorts of little details built into it to help it deal with multi-national companies, but since they’re not commonly used you can find all sorts of odd “buggy” bits of behaviour when you start to look closely. I have to put “buggy” in quotes because some of the reported oddities are the inevitable consequences of (for example) how multi-byte character sets have to work; but some of the oddities look as if they simply wouldn’t be there if the programmer writing the relevant bit of code had remembered that they also had to cater for some NLS feature.

Here’s an example of the type of unexpected behaviour that can appear. There probably are some bugs in the area I’m going to demonstrate but, at first glance, I thought I was looking at an acceptable limitation imposed by a generic requirement. The example came from AskTom. which is why the data set isn’t my usual “t1” generation (and the formatting and capitalisation isn’t according to my usual standards).

The problem involves Case Insensitive indexing.


ALTER session SET nls_sort=binary_ci;
ALTER session SET nls_comp=linguistic;

CREATE TABLE log_data(
  account_id NUMBER,
  log_type NUMBER,
  sys_name VARCHAR2(30),
  log_time TIMESTAMP,
  msg varchar2(4000)
)
nologging
;

insert /*+ append */ into log_data(
  account_id,
  log_type,
  sys_name,
  log_time,
  msg
)
select
        5,
        2,
        dbms_random.string('a',1),
        sysdate + dbms_random.value,
        rpad('x',200)
from
        dual
connect by
        level <= 26000
;


create index log_date on log_data (
        account_id, 
        log_type, 
--      sys_name,
        NLSSORT(sys_name,'NLS_SORT=BINARY_CI'),
        log_time
)
nologging
;
  
rem     ======================================================================
rem     Need to gather stats AFTER index creation because of the hidden column
rem     ======================================================================
  
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'LOG_DATA',
                method_opt       => 'for all columns size 1'
        );
end;
/

And here’s the query I want to optimize:


SELECT 
        *
FROM
  (
    SELECT
        sys_name, log_time,  substr(msg,1,40) msg
    FROM log_data
    WHERE
      account_id=5
      AND log_type=2
      AND sys_name='a'
    ORDER BY
      log_time  desc
  )
WHERE
  rownum <= 10
;

The requirement of the query is that we see the ten most recent entries for a given combination of account_id, log_type and sys_name (ignoring case in sys_name). The orginal table has tens of millions of rows, of course, with many combinations, and some of the combinations have a very large number of entries hence the desire to find an access path that gets just the 10 rows we want without getting all the rows for a combination and sorting them before returning the ten.

Normally we would just create an index that started with the 3 columns used in the equality and ending with the column in the order by clause, and that would be enough for the optimizer to see the option for a “sort order by nosort” operation to get the required data through an index range scan; so that’s the index the code sample creates, except that since we’ve enabled case insensitive sorting we need to use a function-based index to hold the case-insensitive version of sys_name.

Here’s the execution plan we would get if we DIDN’T use the nlssort() function in the index – I’ve run the query in 11.2.0.4 and pulled the plan from memory with rowsource execution stats enabled:


---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |      1 |        |   605 (100)|     10 |00:00:00.02 |    1065 |       |       |          |
|*  1 |  COUNT STOPKEY                 |          |      1 |        |            |     10 |00:00:00.02 |    1065 |       |       |          |
|   2 |   VIEW                         |          |      1 |    500 |   605   (1)|     10 |00:00:00.02 |    1065 |       |       |          |
|*  3 |    SORT ORDER BY STOPKEY       |          |      1 |    500 |   605   (1)|     10 |00:00:00.02 |    1065 |  2048 |  2048 | 2048  (0)|
|   4 |     TABLE ACCESS BY INDEX ROWID| LOG_DATA |      1 |    500 |   603   (1)|    966 |00:00:00.01 |    1065 |       |       |          |
|*  5 |      INDEX RANGE SCAN          | LOG_DATE |      1 |    500 |   103   (3)|    966 |00:00:00.01 |     100 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   3 - filter(ROWNUM<=10)
   5 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2)
       filter(NLSSORT("SYS_NAME",'nls_sort=''BINARY_CI''')=HEXTORAW('6100') )

Notice particularly the filter predicate at operation 5: that’s the thing we need to get into the index before we can avoid picking up excess data and sorting it. Notice also in the A-Rows column that we acquired 966 rows from the table before sorting and discarding all but 10 of them at operation 3.

Notice especially how important it is to look at the predicate section of an execution plan to gain a full understanding of what’s happening.

So here’s the execution plan we get by default with the function-based index in place:


----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |      1 |        |    13 (100)|     10 |00:00:00.01 |     969 |       |       |          |
|*  1 |  COUNT STOPKEY                  |          |      1 |        |            |     10 |00:00:00.01 |     969 |       |       |          |
|   2 |   VIEW                          |          |      1 |     11 |    13   (0)|     10 |00:00:00.01 |     969 |       |       |          |
|*  3 |    SORT ORDER BY STOPKEY        |          |      1 |     11 |    13   (0)|     10 |00:00:00.01 |     969 |  2048 |  2048 | 2048  (0)|
|   4 |     TABLE ACCESS BY INDEX ROWID | LOG_DATA |      1 |   1000 |    13   (0)|    966 |00:00:00.01 |     969 |       |       |          |
|*  5 |      INDEX RANGE SCAN DESCENDING| LOG_DATE |      1 |     11 |     2   (0)|    966 |00:00:00.01 |       5 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   3 - filter(ROWNUM<=10)
   5 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2 AND "LOG_DATA"."SYS_NC00006$"=HEXTORAW('6100') )

It didn’t work ! (Check the A-Rows at operations 4 and 5, and the sort that we didn’t want at operation 3 where the data is finally reduced to 10 rows.

But there’s something odd going on here – look at the predicate section: our three predicates are all access predicates for the index range scan descending. We are doing exactly what we want to do with the index, but we’re not stopping after the 10 rows that we need, we’re getting all of them (in the order we want) and then doing a trivial sort and discard. Look at the Cost column – the cost at operation 4 is exactly what we might expect for the 10 rows we want to see, and the E-rows at line 5 is clearly based on our “first 10 rows” requirement.

This raises two questions:

  1. What’s gone wrong ?
  2. Can we work around the problem ?

The answer to (1) is, I think, that there’s a bug in the code. Looking at the 10053 trace file I can see the optimizer correctly handling the arithmetic of the virtual column (the sys_nc000006$) representing the function in the index and then getting to the point where it goes into a code section relating to “Recost for ORDER BY”, and brings back the original function as a filter predicate – I think that in the recosting it may be losing track of the fact that sys_nc000006$ and nlssort(sys_name, ‘nls_sort=binary_ci’) are the same thing and therefore can’t apply the rule about “Equality on 1st N columns, order by on the remainder”.

There are several answers to (2).

Workarounds

The honest hack

The first one is simply to fall back to the old (probably version 7, possibly version 8) requirement for getting the “sort order by nosort” operation – put all the index columns into the order by clause. Unfortunately the optimizer then did a tablescan rather than an index range scan because my data set was so small, so I had to hack the system stats temporarily to make the tablescan very expensive:


begin
        dbms_stats.set_system_stats('MBRC',2);
        dbms_stats.set_system_stats('MREADTIM',20); 
        dbms_stats.set_system_stats('SREADTIM',5);
        dbms_stats.set_system_stats('CPUSPEED',1000); 
end;
/

... order by account_id desc, log_type desc, sys_name desc, lot_time desc

Unfortunately the optimizer still went wrong – it did an ASCENDING index range scan sorting all the data. I actually had to hint the code to use the index in descending order to get the following execution plan:


------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |      1 |        |  1215 (100)|     10 |00:00:00.01 |      13 |
|*  1 |  COUNT STOPKEY                 |          |      1 |        |            |     10 |00:00:00.01 |      13 |
|   2 |   VIEW                         |          |      1 |   1000 |  1215   (1)|     10 |00:00:00.01 |      13 |
|   3 |    TABLE ACCESS BY INDEX ROWID | LOG_DATA |      1 |   1000 |  1006   (1)|     10 |00:00:00.01 |      13 |
|*  4 |     INDEX RANGE SCAN DESCENDING| LOG_DATE |      1 |   1000 |     5   (0)|     10 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   4 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2 AND "LOG_DATA"."SYS_NC00006$"=HEXTORAW('6100') )

The A-Rows tells us we’ve accessed the minimum data set, and the absence of the SORT ORDER BY STOPKEY operation tells us that we’ve avoided doing the sort. Notice, though that the cost is the cost that would have been appropriate if we have accessed all 1,000 rows that matched the equality predicates. This is an example of a plan that you couldn’t really trust if all you had done was an “explain plan” rather than running the query and checking the rowsource execution stats. If you ignore the A-Rows it looks as if the plan WOULD get all the data in order and only eliminate the redundant rows at operation 1.

The silly surprise

The original author of the problem came up with this one. Put in two predicates which, between them are equivalent to the original requirement:


where ...
and     sys_name >= 'a'
and     sys_name <= 'a'

Clearly this is totally silly – the optimizer can fold this pair of predicates into the single predicate “sys_name = ‘a'”, so it shouldn’t make any difference. But here’s the execution plan:

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |      1 |        |    13 (100)|     10 |00:00:00.01 |      13 |
|*  1 |  COUNT STOPKEY                 |          |      1 |        |            |     10 |00:00:00.01 |      13 |
|   2 |   VIEW                         |          |      1 |     11 |    13   (0)|     10 |00:00:00.01 |      13 |
|   3 |    TABLE ACCESS BY INDEX ROWID | LOG_DATA |      1 |   1000 |    13   (0)|     10 |00:00:00.01 |      13 |
|*  4 |     INDEX RANGE SCAN DESCENDING| LOG_DATE |      1 |     11 |     2   (0)|     10 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   4 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2 AND "LOG_DATA"."SYS_NC00006$"=HEXTORAW('6100') )

Yes, it’s (structurally) exactly the same plan, with exactly the same predicate section except that (a) it gets there without being hinted, (b) the Cost column looks appropriate all down the line, and (c) the E-Rows value for the VIEW operator would have helped us appreciate that the correct elimination was (probably) going to happen if all we had done was the Explain Plan.

The dirty hack

I know the name of the hidden column that’s causing the problem, and I know how to generate the value it has to be – so let’s give Oracle exactly what it needs to see rather than allowing its internal transformation to rewrite the SQL:

...
AND sys_nc00006$ = nlssort('a','nls_sort=binary_ci')
...


------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |      1 |        |    13 (100)|     10 |00:00:00.01 |      13 |
|*  1 |  COUNT STOPKEY                 |          |      1 |        |            |     10 |00:00:00.01 |      13 |
|   2 |   VIEW                         |          |      1 |     11 |    13   (0)|     10 |00:00:00.01 |      13 |
|   3 |    TABLE ACCESS BY INDEX ROWID | LOG_DATA |      1 |   1000 |    13   (0)|     10 |00:00:00.01 |      13 |
|*  4 |     INDEX RANGE SCAN DESCENDING| LOG_DATE |      1 |     11 |     2   (0)|     10 |00:00:00.01 |       3 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM<=10)
   4 - access("ACCOUNT_ID"=5 AND "LOG_TYPE"=2 AND "SYS_NC00006$"=HEXTORAW('6100') )

We get exactly the plan we need – and the silly thing about this example is that it’s a case where we get the plan we want by EXPLICITLY transforming the SQL to reproduce the transformation that Oracle had done IMPLICITLY and then messed up !

Final Choice
Of the three options – the dirty hack is definitely a no-no in production; the “double the predicate” trock is undesirable because it may depend in some unexpected way on a particular optimizer bug or on some statistical detail that could change; so I’d choose the hinted path with the (nominally) redundant columns.

One final point about this solution, we actually needed to include only the sys_name in the order by clause to use the descending range scan and early stop – which is basically another indication that it’s something about the function-based column that is breaking the normal code path.

Reference Script: nls_sort_anomaly.sql

January 4, 2016

ANSI bug

Filed under: ANSI Standard,Bugs,Oracle — Jonathan Lewis @ 1:12 pm BST Jan 4,2016

In almost all cases the SQL you write using the ANSI (SQL-92) standard syntax is tranformed into a statement using Oracle’s original syntax before being optimised – and there are still odd cases where the translation is not ideal.  This can result in poor performance, it can result in wrong results. The following examples arrived in my in-tray a couple of weeks ago:

with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select *
    from (
        select sum(table3.table2.my_number) the_answer
            from table1
            left join table2 on table1.my_number = table2.my_number
            group by table1.my_number
        );


with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select sum(table3.table2.my_number) the_answer
    from table1
    left join table2 on table1.my_number = table2.my_number
    group by table1.my_number;

Notice the reference to table3.table2.my_number in the select list of both queries – where does the “table3” bit come from ? These queries should result in Oracle error ORA-00904: “TABLE3″.”TABLE2″.”MY_NUMBER”: invalid identifier.

If you’re running 11.2.0.4 (and, probably, earlier versions) both queries produce the following result:


THE_ANSWER
----------
         1

1 row selected.

If you’re running 12.1.0.2 the first query produces the ORA-00904 error that it should do, but the second query still survives to produce the same result as 11.2.0.4.

Update (8th July 2016)

I’ve just seen a reference to this blog note on the OTN database forum that prompted me to point out that the problem doesn’t occur if you switch to traditional Oracle syntax – the expected Oracle error message appears in 11.2.0.4:

SQL> ed temp

with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select *
    from (
        select sum(table3.table2.my_number) the_answer
/*
            from table1
            left join table2 on table1.my_number = table2.my_number
*/
            from table1, table2
            where table2.my_number(+) = table1.my_number
            group by table1.my_number
        )
 15  ;
        select sum(table3.table2.my_number) the_answer
                   *
ERROR at line 6:
ORA-00904: "TABLE3"."TABLE2"."MY_NUMBER": invalid identifier


SQL> ed temp3

with
    table1 as ( select 1 my_number from dual ),
    table2 as ( select 1 my_number from dual )
select *
    from (
        select sum(table3.table2.my_number) the_answer
--            from table1
--            left join table2 on table1.my_number = table2.my_number
            from table1, table2
            where table2.my_number(+) = table1.my_number
            group by table1.my_number
        )
 13  ;
        select sum(table3.table2.my_number) the_answer
                   *
ERROR at line 6:
ORA-00904: "TABLE3"."TABLE2"."MY_NUMBER": invalid identifier

January 1, 2016

Expert

Filed under: Oracle,Philosophy — Jonathan Lewis @ 1:02 pm BST Jan 1,2016

I was sent the following email a few years ago. It’s a question that comes up fairly frequently and there’s no good answer to it but, unusually, I made an attempt to produce a response; and I’ve decided that I’d start this year by presenting the question and quoting the answer I gave so here, with no editing is the question:

I’m disturbing you for some help about becoming an Oracle master expert. Probably you are getting this kind of emails a lot but I would be appreciate if you give a small answer to me at least.

First, shortly I want to introduce my self. I’m an *Oracle Trainer* in Turkey Oracle University for 2 years. Almost for 4 years, I worked as software engineer and meet with Oracle on these days. After a while I decided to develop myself in Oracle database technologies and become trainer as i said. I also give consultancy services about SQL / PLSQL development and especially* SQL / PLSQL tuning*. I really dedicate myself to these subjects. As a trainer I also give DBA workshop lectures but in fact I didnt actually did dba job in a production system. I have the concept and even read everything I found about it but always feel inadequate because didnt worked as a DBA on a production system. So many DBA’s has taken my class and they were really satisfied (they have got all answers for their questions) but I did not. I’m a good trainger (with more that 97 average points in oracle evaluations) but I want to be best.

Even in SQL / PLSQL tuning, I know that I am really good at it but I also aware that there are some levels and I can not pass through the next level. for ex: I can examine execution plan (index structures, access paths etc), find cpu and io consumption using hierarchical profiler and solve the problem but can’t understand yet how to understand how much IO consumed by query and understand slow segments. if you remember, for a few days ago, on OTN you answered a question that I involved about sequence caching and Log file sync event. There, I said that sequence can cause to log file sync event (and as you said that was true) but when someone else write a simple code and couldnt see this event, I couldnt answer to him, you did (you said that it was because optimizing).

that is the level what i want to be. I am really working on this and age on 29. but whatever I do I cant get higher. I need a guideness about that. I even worked free for a while (extra times after my job here). I need your guideness, as I said I can work with you if you want to test and I want to learn more advanced topics while working. In Turkey, I couldn’t find people who can answer my questions so I can not ask for guideness to them.

And my (impromptu, and unedited) reply:

Thank you for your email. You are correct, I do get a lot of email like this, and most of it gets a stock response; but yours was one of the most intelligently written so I’ve decided to spend a little time giving you a personal answer.

Even if you were to spend a few years as a DBA, you would probably not become the sort of expert you want to be. Most DBAs end up dealing with databases that, for want of a better word, we could call “boring”; for a database to be interesting and show you the sorts of problems where you have to be able to answer the types of question I regularly answer you probably need to be the DBA for a large banking or telecoms system – preferably one that hasn’t been designed very well – that has to handle a very large volume of data very quickly. On these extreme systems you might find that you keep running into boundary conditions in Oracle that force you to investigate problems in great detail and learn all sorts of strange things very quickly. On most other systems you might run into a strange problem very occasionally and spend several years on the job without once being forced to solve any difficult problems very quickly.

If you want to become an expert, you need to be a consultant so you get to see a lot of problems on lots of different systems in a very short time; but you can’t really become a consultant until you’re an expert. As a substitute, then, you need to take advantage of the problems that people report on the OTN database forum – but that doesn’t mean just answering questions on OTN. Look for the problems which people have described reasonably well that make you think “why would that happen”, then try to build a model of the problem that has been described and look very closely at all the statistics and wait events that change as you modify the model. Creating models, and experimenting with models, is how you learn more.

Take, for example, the business of the sequences and pl/sql – you might run the test as supplied with SQL_trace enabled to see what that showed you, you could look very carefully at the session stats for the test and note the number of redo entries, user commits, and transactions reported; you could look at the statistics of enqueue gets and enqueue releases, ultimately you might dump the redo log file to see what’s going into it. Many of the tiny little details I casually report come from one or two days of intense effort studying an unexpected phenomenon.  (The log file sync one was the result of such a study about 15 years ago.)

 Happen new year to all my readers.

December 29, 2015

Column Groups

Filed under: extended stats,Oracle,Statistics,Tuning — Jonathan Lewis @ 1:13 pm BST Dec 29,2015

I think the “column group” variant of extended stats is a wonderful addition to the Oracle code base, but there’s a very important detail about using the feature that I hadn’t really noticed until a question came up on the OTN database forum recently about a very bad join cardinality estimate.

The point is this: if you have a multi-column equality join and the optimizer needs some help to get a better estimate of join cardinality then column group statistics may help if you create matching stats at both ends of the join. There is a variation on this directive that helps to explain why I hadn’t noticed it before – multi-column indexes (with exactly the correct columns) have the same effect and, most significantly, the combination of  one column group and a matching multi-column index will do the trick.

Here’s some code to demonstrate the effect:

create table t8
as
select
        trunc((rownum-1)/125)   n1,
        trunc((rownum-1)/125)   n2,
        rpad(rownum,180)        v1
from
        all_objects
where
        rownum <= 1000
;

create table t10
as
select
        trunc((rownum-1)/100)   n1,
        trunc((rownum-1)/100)   n2,
        rpad(rownum,180)        v1
from
        all_objects
where
        rownum <= 1000
;
begin
        dbms_stats.gather_table_stats(
                user,
                't8',
                method_opt => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                user,
                't10',
                method_opt => 'for all columns size 1'
        );
end;
/

set autotrace traceonly

select
        t8.v1, t10.v1
from
        t8,t10
where
        t10.n1 = t8.n1
and     t10.n2 = t8.n2
/

set autotrace off

Table t8 has eight distinct values for n1 and n2, and 8 combinations (though the optimizer will assume there are 64 combinations); table t10 has ten distinct values for n1 and n2, and ten combinations (though the optimizer will assume there are 100 combinations). In the absence of any column group stats (or histograms, or indexes) and with no filter predicates on either table, the join cardinality will be “{Cartesian Join cardinality} * {join selectivity}”, and in the absence of any nulls the join selectivity – thanks to the “multi-column sanity check” – will be 1/(greater number of distinct combinations). So we get 1,000,000 / 100 = 10,000.

Here’s the output from autotrace in 11.2.0.4 to prove the point:


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 10000 |  3652K|    11  (10)| 00:00:01 |
|*  1 |  HASH JOIN         |      | 10000 |  3652K|    11  (10)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T8   |  1000 |   182K|     5   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T10  |  1000 |   182K|     5   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T10"."N1"="T8"."N1" AND "T10"."N2"="T8"."N2")


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        835  consistent gets
          0  physical reads
          0  redo size
   19965481  bytes sent via SQL*Net to client
      73849  bytes received via SQL*Net from client
       6668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
     100000  rows processed

As you can see, the query actually returns 100,000 rows. The estimate of 10,000 is badly wrong thanks to the correlation between the n1 and n2 columns. So let’s check the effect of creating a column group on t10:


begin
        dbms_stats.gather_table_stats(
                user,
                't10',
                method_opt => 'for all columns size 1 for columns (n1,n2) size 1'
        );
end;
/

At this point you might think that the optimizer’s sanity check might say something like: t8 table: 64 combinations, t10 table column group 10 combinations so use the 64 which is now the greater num_distinct. It doesn’t – maybe it will in some future version, but at present the optimizer code doesn’t seem to recognize this as a possibility. (I won’t bother to reprint the unchanged execution plan.)

But, at this point, I could create an index on t8(n1,n2) and run the query again:


create index t8_i1 on t8(n1, n2);

select
        t8.v1, t10.v1
from
        t8,t10
where
        t10.n1 = t8.n1
and     t10.n2 = t8.n2
/

Index created.


100000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 216880280

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   100K|    35M|    12  (17)| 00:00:01 |
|*  1 |  HASH JOIN         |      |   100K|    35M|    12  (17)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T8   |  1000 |   182K|     5   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T10  |  1000 |   182K|     5   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T10"."N1"="T8"."N1" AND "T10"."N2"="T8"."N2")

Alternatively I could create a column group at the t8 table:



drop index t8_i1;

begin
        dbms_stats.gather_table_stats(
                user,
                't8',
                method_opt => 'for all columns size 1 for columns (n1,n2) size 1'
        );
end;
/

select  
        t8.v1, t10.v1 
from
        t8,t10
where
        t10.n1 = t8.n1
and     t10.n2 = t8.n2
/

Index dropped.


PL/SQL procedure successfully completed.


100000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 216880280

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |   100K|    35M|    12  (17)| 00:00:01 |
|*  1 |  HASH JOIN         |      |   100K|    35M|    12  (17)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T8   |  1000 |   182K|     5   (0)| 00:00:01 |
|   3 |   TABLE ACCESS FULL| T10  |  1000 |   182K|     5   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T10"."N1"="T8"."N1" AND "T10"."N2"="T8"."N2")


If you’re wondering why I’ve not picked up this “both ends” detail in the past – it’s because I’ve usually been talking about replacing indexes with column groups and my examples have probably started with indexes at both end of the join before I replaced one index with a column group. (The other examples I’ve given of column groups are typically about single-table access rather than joins.)

 

December 23, 2015

Testing

Filed under: Oracle,Philosophy — Jonathan Lewis @ 12:56 pm BST Dec 23,2015

This post is a 100% copy of a message that Tanel Poder sent to the Oracle-L mailing list in response to a thread about the performance of SSD. It’s not just a good answer to the question, it’s a wonderfully succinct insight into how to think about what you’re really testing and it displays the mind-set that should be adopted by everyone.

If you measure write performance on an idle Exadata machine without any other load going on, you are not comparing flash vs disk, you are comparing flash vs the battery-backed 512MB RAM cache in the “RAID” controllers within each storage cell!

This is how the “disk” that’s supposed to have a couple of milliseconds of average latency (it still rotates and needs to seek + calibrate to next track even in sequential writes) gives you sub-millisecond write latencies… it’s not the disk write, it’s the controller’s RAM write that gets acknowledged.

And now when you run a real workload on the machine (lots of random IOs on the disk and Smart Scans hammering them too), your disk writes won’t be always acknowledged by the controller RAM cache. When comparing *busy* flash disks to *busy* spinning disks vs. *idle* flash disks vs *idle* spinning disks (with non-dirty write cache) you will get different results.

So, I’m not arguing here that flash is somehow faster for sequential writes than a bunch of disks when talking about throughput. But if you care about latency (of your commits) you need to be aware of everything else that will be going on on these disks (and account for this in your benchmarks).

Without queueing time included, a busy flash device will “seek” where needed and perform the write in under a millisecond, a busy disk device in 6-10 milliseconds. So your commits will end up having to wait for longer (yes, your throughput will be ok due to the LGWR writing multiple transactions redo out in a single write, but this doesn’t change the fact that individual commit latency suffers).

This latency issue of course will be mitigated when you are using a decent storage array with enough (well-managed) write cache.

So I’d say there are the following things you can compare (and need to be aware of which hardware are you really benchmarking):

1) Flash storage
2) Disk storage without (write) cache
3) Disk storage with crappy (write) cache
4) Disk storage with lots of well-managed & isolated (write) cache

And the second thing to be aware of:

1) Are you the single user on an idle storage array
2) Are you just one of the many users in a heavily utilized (and randomly seeking) storage array

So, as usual, run a realistic workload and test it out yourself (if you have the hardware :)

December 22, 2015

Predicates

Filed under: Execution plans,Oracle,Tuning — Jonathan Lewis @ 12:58 pm BST Dec 22,2015

I received an email recently that started with the sort of opening sentence that I see far more often than I want to:

I have come across an interesting scenario that I would like to run by you, for your opinion.

It’s not that I object to being sent interesting scenarios, it’s just that they are rarely interesting – and this wasn’t one of those rare interesting ones. On the plus side it reminded me that I hadn’t vented one of my popular rants for some time.

Here’s the problem – see if you can work out the error before you get to the rant:

“I’ve got a table and a view on that table; and I’ve got a query that is supposed to use the view. Whether I use the table or the view in query the optimizer uses the primary key on the table to access the table – but when I use the table the query takes about 30 ms, when I use the view the query takes about 903 ms”.

The email included a stripped-down version of the problem (which I’ve stripped even further) – so score some brownie points on that one.  Here, in order, are the table, the view, and two variations of the query:


create table table_a (
	col_1  varchar2(20)	not null,
	col_2  number(10)	not null,
	col_3  varchar2(20)	not null,
	col_4  varchar2(100)
);

insert /*+ append */ into table_a
select
	lpad(mod(rownum-1,1000),10), mod(rownum-1,1000), lpad(rownum,20), rpad(rownum,100)
from
	all_objects
where
	rownum <= 10000
;
commit; 

alter table table_a add constraint ta_pk primary key(col_1, col_2, col_3); 
execute dbms_stats.gather_table_stats(user,'table_a',method_opt=>'for all columns size 1')

create or replace view view_a (
	col1,
	col2,
	col3,
	col4
)
as
select 
	col_1 as col1,
	cast(col_2 as number(9)) as col2,
	col_3 as col3,
	col_4 as col4
from
	table_a
;


variable b1 varchar2(10)
variable b2 number

exec :b1 := lpad(0,10)
exec :b2 := 0

select /*+ index(table_a) tracking_t2 */
	 *
from	table_a
where 
	col_1 = :b1
and	col_2 = :b2
;

select /*+ index(view_a.table_a) tracking_v2 */
	*
from	view_a
where 
	col1 = :b1
and	col2 = :b2
;

Question 1 (for no points): Why would there be a difference (though very small in this example) in performance ?

Question 2 (for a virtual pat on the head): What did the author of the email not do that made him think this was an interesting problem ?

Just to muddy the water for those who need a hint (that’s a hint hint, not an Oracle hint) – here are the two execution plans reprted from v$sql in version 12.1.0.2:


SQL_ID  514syc2mcb1wp, child number 0
-------------------------------------
select /*+ index(table_a) tracking_t2 */   * from table_a where  col_1
= :b1 and col_2 = :b2

Plan hash value: 3313752691

---------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |         |      1 |        |     10 |00:00:00.01 |      13 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TABLE_A |      1 |      1 |     10 |00:00:00.01 |      13 |
|*  2 |   INDEX RANGE SCAN                  | TA_PK   |      1 |      1 |     10 |00:00:00.01 |       3 |
---------------------------------------------------------------------------------------------------------


SQL_ID  ck0y3v9833wrh, child number 0
-------------------------------------
select /*+ index(view_a.table_a) tracking_v2 */  * from view_a where
col1 = :b1 and col2 = :b2

Plan hash value: 3313752691

---------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |         |      1 |        |     10 |00:00:00.01 |      13 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| TABLE_A |      1 |      1 |     10 |00:00:00.01 |      13 |
|*  2 |   INDEX RANGE SCAN                  | TA_PK   |      1 |      1 |     10 |00:00:00.01 |       3 |
---------------------------------------------------------------------------------------------------------

I’ve even shown you the Plan Hash Values for the two queries so you can check that the execution plans were the same.

So what have I just NOT done in my attempt to make it harder for you to understand what is going on ?

Give yourself a pat on the head if you’ve been thinking “Where’s the predicate section for these plans ?”  (9 years old today).

Here are the two predicate sections (in the same order as the plans above):


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("COL_1"=:B1 AND "COL_2"=:B2)


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("COL_1"=:B1)
       filter(CAST("COL_2" AS number(9))=:B2)

Notice how the optimizer can use both predicates to probe the index when we query the table but, thanks to the function applied to the column in the view, can only probe the index on the first column of the view and has to check every index entry for the first input value to see of the result of the cast matches the second input value. The size of the range scan in the second case could be much larger than the size of the range scan in the first case – the difference in performance could simply be a reflection that col_1 is very repetitive with many different values of col_2 for every value of col_1.

Interesting

While the problem itself isn’t interesting – it does raise a couple of points worth mentioning (and I’m not going to ask why the view has that surprising cast() in it – but if pushed I could invent a reason)

First, what steps have been taken to ensure that a query against the view won’t crash with Oracle error 1438:

SQL> insert into table_a values(:b1, 1e9,'x','x');

1 row created.

SQL> select * from view_a where col1 = :b1;
ERROR:
ORA-01438: value larger than specified precision allowed for this column

Possibly there’s a check constraint on the column restricting it to values that can survive the cast to number(9).

Secondly, it’s often possible to use constraints or virtual columns (or both together) that allow the optimizer to get clever with expression substitution and come up with optimal execution plans even when there are traps like this put in the way. In this case I couldn’t manage to make the usual tricks work. Possibly the only way to get the hoped-for performance is to create a second index on (col_1, cast(col_2) as number(9), col_3).

December 21, 2015

Parallel Plans

Filed under: 12c,Oracle,Parallel Execution — Jonathan Lewis @ 8:40 am BST Dec 21,2015

This article was prompted by a pair of articles by Yasin Baskan of Oracle Corporation: PX Server Sets etc. and Multiple Parallelizers, plus a little extra prompting from a mistake that I made when reading the second of those two articles. The fact that I made a mistake is significant because, without it, I wouldn’t have created a model to check Yasin’s description of the parallel activity.

I want to examine the following query to find out the order of activity:


select
        distinct t1.n1
from
        t1
where 
        t1.n1 >= 500
and     t1.n2 > (select avg(t2.n2) from t2)
and     t1.n3 > (select avg(t3.n3) from t3)
;

I’m going to be using 12.1.0.2 for this demonstration (lots of changes appear in parallel execution as you move from 11g to 12c, so the version is most significant), with all default settings for the parallel-related parameters.

To get things started, here’s the code to generate the data that I used, and the serial execution plan:


create table t1 nologging as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1000)        n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4
;

create table t2 nologging as
with generator as ( 
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4
;

create table t3 nologging as
with generator as ( 
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4 ;

-- don't need to gather stats as they are generated on CTAS in 12c.
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   150 (100)|          |
|   1 |  HASH UNIQUE         |      |    24 |   336 |   150   (4)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T1   |    25 |   350 |    51   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 20000 |    97K|    49   (3)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T3   | 20000 |    97K|    49   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">=500 AND "T1"."N2"> AND "T1"."N3">))

I’ve pulled the execution plan from memory using dbms_xplan.display_cursor(), which is why the predicate section shows the odd comparison of “t2.n2 > {nothing there}” and “t3.n3 > {nothing there}”. The predicate section produced by a simple call to explain plan would help us to fill in the details:


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."N1">=500 AND "T1"."N2"> (SELECT AVG("T2"."N2") FROM
              "T2" "T2") AND "T1"."N3"> (SELECT AVG("T3"."N3") FROM "T3" "T3"))

We can infer from the predicate section that Oracle executes the t2 subquery first and the t3 subquery second; but there is a slight oddity in the shape of the execution plan if we apply the “first child first / recursive descent” approach to reading the plan. It looks as if operations 3 and 5 are children of line 2 and should be run before child 2. This is the consequence of “subquery pushing”. In fact the subqueries are both filter subqueries and, in principle (and ignoring scalar subquery caching for the moment), they will be run once for each row produced from the tablescan of operation 2.

The activity is as follows:

  • start scanning table t1
  • when the first row appears where n1 >= 500 run the subquery against t2 to derive the average of t2.n2 (10000.5) and check to see if the n2 value of the current t1 row exceeds that value (it doesn’t);
  • continue scanning t1 checking for rows where n1 >= 500 and n2 >= 10000.5 (no need to rerun the subquery thanks to scalar subquery caching),
  • on finding the first row matching both predicates run the subquery against t3 to derive the average of t3.n3 (10000.5) and check to see if the n3 value of the current t1 row exceeds that value.
  • continue the scan of t1 checking for rows where n1 >= 500 and n2 >= 10000.5 and n3 >= 10000.5 (again scalar subquery caching means no further executions of either subquery).

We can add weight to the argument that this is the order of activity by flushing the buffer cache and enabled the extended trace before running the query.  Here’s a section of a trace file that demonstrates the point – with no edits beyond the insertion of a few blank lines:


PARSE #140345652175312:c=1000,e=1813,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=115251305,tim=134216063372
EXEC #140345652175312:c=0,e=62,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=115251305,tim=134216063544
WAIT #140345652175312: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=134216063569
WAIT #140345652175312: nam='db file sequential read' ela= 9 file#=5 block#=131 blocks=1 obj#=133974 tim=134216063627
WAIT #140345652175312: nam='db file scattered read' ela= 315 file#=5 block#=132 blocks=112 obj#=133974 tim=134216064187

WAIT #140345652175312: nam='db file sequential read' ela= 9 file#=5 block#=515 blocks=1 obj#=133975 tim=134216064438
WAIT #140345652175312: nam='db file scattered read' ela= 115 file#=5 block#=516 blocks=112 obj#=133975 tim=134216064732
WAIT #140345652175312: nam='db file scattered read' ela= 18 file#=5 block#=628 blocks=12 obj#=133975 tim=134216066046
WAIT #140345652175312: nam='db file scattered read' ela= 123 file#=5 block#=642 blocks=112 obj#=133975 tim=134216066548
WAIT #140345652175312: nam='db file scattered read' ela= 20 file#=5 block#=754 blocks=14 obj#=133975 tim=134216067760
WAIT #140345652175312: nam='db file scattered read' ela= 124 file#=5 block#=770 blocks=112 obj#=133975 tim=134216068153
WAIT #140345652175312: nam='db file scattered read' ela= 8 file#=5 block#=882 blocks=2 obj#=133975 tim=134216069677

WAIT #140345652175312: nam='db file scattered read' ela= 18 file#=5 block#=244 blocks=12 obj#=133974 tim=134216070049
WAIT #140345652175312: nam='db file scattered read' ela= 127 file#=5 block#=258 blocks=112 obj#=133974 tim=134216070382

WAIT #140345652175312: nam='db file sequential read' ela= 6 file#=5 block#=899 blocks=1 obj#=133976 tim=134216070812
WAIT #140345652175312: nam='db file scattered read' ela= 41 file#=5 block#=900 blocks=112 obj#=133976 tim=134216071028
WAIT #140345652175312: nam='db file scattered read' ela= 42 file#=5 block#=1012 blocks=12 obj#=133976 tim=134216072956
WAIT #140345652175312: nam='db file scattered read' ela= 215 file#=5 block#=1026 blocks=112 obj#=133976 tim=134216073642
WAIT #140345652175312: nam='db file scattered read' ela= 29 file#=5 block#=1138 blocks=14 obj#=133976 tim=134216076334
WAIT #140345652175312: nam='db file scattered read' ela= 154 file#=5 block#=1154 blocks=112 obj#=133976 tim=134216076825
WAIT #140345652175312: nam='db file scattered read' ela= 14 file#=5 block#=1266 blocks=2 obj#=133976 tim=134216081673

WAIT #140345652175312: nam='db file scattered read' ela= 24 file#=5 block#=370 blocks=14 obj#=133974 tim=134216082815
WAIT #140345652175312: nam='db file scattered read' ela= 144 file#=5 block#=386 blocks=112 obj#=133974 tim=134216083232
WAIT #140345652175312: nam='db file scattered read' ela= 20 file#=5 block#=498 blocks=12 obj#=133974 tim=134216084494

FETCH #140345652175312:c=16998,e=21096,p=1105,cr=1111,cu=0,mis=0,r=1,dep=0,og=1,plh=115251305,tim=134216084683

Object 133974 is table t1, object 133975 is table t2, and object 133976 is table t3.

As you can see, we start by scanning t1, then we scan t2 once, then we scan more of t1, then we scan t3 once, then we finish off scanning t1.

We could chase the detail a little further, of course – we could check the session statistics to make sure that we haven’t been using the buffer cache to scan t2 and t3 repeatedly, or we could enable rowsource execution statistics to report the number of times each operation started, or we could check the STAT# lines dumped a little later in the trace files and infer that the t2 and t3 tablescans happened just once each.

At this point I’ll mention my mistake: when I originally read Yasin’s version of the parallel equivalent of this plan my first thought was that the subquery sections of the plan operate from the bottom upwards (i.e. another one of those examples where “first child first” doesn’t apply) whereas Yasin described them as running top down. I was wrong – I was thinking of the case where the optimizer analyses the query from the bottom up, but it still produces a plan that then runs from the top down – but I didn’t realize my mistake until after I’d been poking around with the parallel version of the query and had made a surprising discovery.

Parallel Execution

So what happens when we manipulate this query into parallelism ? To make it a little easier to see some of the details I’ve added hints to each query block to use a different degree of parallelism; and then, because it all happens so fast, I’ve introduced a function call to slow down the progress of the query. Here’s the function I’ve created (you’ll find it elsewhere on blog) and the final query I started testing with:


create or replace function wait_row (
        i_secs  number,
        i_return        number
) return number
parallel_enable
is
begin
        dbms_lock.sleep(i_secs);
        return i_return;
end;
/


select
        /*+ parallel(t1,5) tracking */ 
        distinct t1.n1
from
        t1
where   
        wait_row(0.01,t1.n2) > (select /*+ parallel(t2,3) */ avg(wait_row(0.01,t2.n2)) from t2)
and     wait_row(0.01,t1.n3) > (select /*+ parallel(t3,2) */ avg(wait_row(0.01,t3.n3)) from t3)
and     t1.n1 >= 1000
--      and     wait_row(0.01, t1.n1) >= 500
;

The wait_row() function simply waits for the number of seconds given in the first parameter then returns the (numeric) value of the second parameter. It calls the dbms_lock.sleep() function which has a granularity of 1/100th of a second, and since I’m planning to use it in a parallel query I’ve included the parallel_enable declaration.

Since the function has not been declared deterministic, or assigned to the pl/sql result cache, and since there are 20,000 rows in the t2 table the time to run the t2 subquery will be roughly 20,000 * 0.01 seconds, which (ignoring parallelism) would be about 200 seconds. I’ve actually hinted DOP = 3 for that subquery so I hope it will complete in about 70 seconds, and the subquery against t3 with hinted DOP = 2 should complete in about 100 seconds.

You’ll notice I have two possible predicates against t1 – the one shown uncommented here doesn’t use call the wait_row() function, the other does. Here’s an interesting point, though, about the example that doesn’t call wait_row() – the simple predicate is going to eliminate ALL the data from t1, there are no rows where n1 >= 1000 and that’s an important point to remember.

Here’s the execution plan:


------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       |    57 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)          | :TQ30001 |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | P->S | QC (RAND)  |
|   3 |    HASH UNIQUE                 |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | PCWP |            |
|   4 |     PX RECEIVE                 |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | PCWP |            |
|   5 |      PX SEND HASH              | :TQ30000 |     1 |    14 |    57   (4)| 00:00:01 |  Q3,00 | P->P | HASH       |
|   6 |       HASH UNIQUE              |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR       |          |     1 |    14 |    11   (0)| 00:00:01 |  Q3,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL      | T1       |     1 |    14 |    11   (0)| 00:00:01 |  Q3,00 | PCWP |            |
|   9 |          SORT AGGREGATE        |          |     1 |     5 |            |          |  Q3,00 | PCWP |            |
|  10 |           PX COORDINATOR       |          |       |       |            |          |        |      |            |
|  11 |            PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|  12 |             SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|  13 |              PX BLOCK ITERATOR |          | 20000 |    97K|    18   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|* 14 |               TABLE ACCESS FULL| T2       | 20000 |    97K|    18   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  15 |          SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|  16 |           PX COORDINATOR       |          |       |       |            |          |        |      |            |
|  17 |            PX SEND QC (RANDOM) | :TQ20000 |     1 |     5 |            |          |  Q2,00 | P->S | QC (RAND)  |
|  18 |             SORT AGGREGATE     |          |     1 |     5 |            |          |  Q2,00 | PCWP |            |
|  19 |              PX BLOCK ITERATOR |          | 20000 |    97K|    27   (0)| 00:00:01 |  Q2,00 | PCWC |            |
|* 20 |               TABLE ACCESS FULL| T3       | 20000 |    97K|    27   (0)| 00:00:01 |  Q2,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N1">=1000 AND "WAIT_ROW"(.01,"T1"."N2")> AND "WAIT_ROW"(.01,"T1"."N3")>))
  14 - access(:Z>=:Z AND :Z<=:Z) 20 - access(:Z>=:Z AND :Z<=:Z)

The odd “:Z>=:Z AND :Z <=:Z” predicate is an artefact of rowid-range scans in parallel tablescans, and we’ve seen the “disappearing subquery” problem with predicates pulled from memory so that’s ignorable. The interesting point is that the “T1.N1 >= 1000” predicate will eliminate all the data from t1 very quickly – so why did it take roughly 3 minutes and 30 seconds to run the query ?

Quick Answer – Oracle ran the two subqueries before it started the t1 tablescan !

Slow Answer – I need to start by filling in a lot of background information.

Notice how we have multiple DFO Trees in this execution plan. In the TQ column these are captured as (Q1,xx) (Q2,xx) and (Q3,xx).  In previous blogs I’ve explained how you can follow the order of activity by following the table queues (effectively the “xx” – also known as virtual tables, also known as Data Flow Operations) in order, but I haven’t explained how you deal with multiple DFO trees. It would be nice to think that DFO Trees implicitly told you the order of operation – unfortunately that’s not true (at least in current versions of Oracle), and this query demonstrates the point.

Table Queues, DFOs (data flow operations), and DFO Trees

The first of Yasin Baskan’s two articles describes the terminology to use when describing the movement of data through a parallel execution plan, but for the sake of familiarity I’ll walk through the plan above using the appropriate terms.  (Note: once upon a time the term “DFO tree” didn’t exist and the term DFO was used to describe what is now called a DFO tree, and a number of articles I’ve written in the past echo this out-dated usage).

Lines 1 – 8 show a DFO tree consisting of two DFOs – the tree is labelled as Q3. The first DFO in the tree (DFO 0) is described in lines 5 – 8 where one set of parallel execution slaves scans real table t1 and sends a data set to the other set parallel execution slaves by writing into a virtual table (:TQ30000); the second DFO in the tree (DFO 1) is described in lines 1 – 4 where a set of parallel execution slaves reads from virtual table :TQ30000, does some work and sends a data set to the query co-ordinator by writing to virtual table :TQ30001.

Lines 9 – 14 show a second DFO tree consisting of a single DFO – this tree is labelled Q1;

Lines 15 – 20 show us a third DFO tree, labelled Q2, also consisting of a single DFO.

Essentially (as described by Yasin Baskan) the focal point of a single DFO (data flow operation) is a PX SEND, and the top of a DFO tree is a PX Coordinator.

The key to performance problems and interpreting tricky parallel execution plans is the fact that in the extreme case every DFO tree may operate at a different degree of parallelism, and all the DFO trees may be operating concurrently – an unlucky execution plan may acquire far more parallel execution slaves than you expect.

So what’s the order of execution in this query, how do we find out, how many parallel query slaves will we start (and stop) as the query runs, and what’s the maximum number of parallel query slaves that we will be holding concurrently ? There are lots of places you could look:

  • v$pq_tqstat might give you some idea of processes used and data moved – after you’ve run the query, but it’s a bit broken for multiple DFO trees.
  • v$px_sesstat will give you some information about the work done by currently live slave processes if you can query it while your query is still running (“set pause on” may help).
  • v$px_session joined to v$session_event will give you some information about the time lost to waits for currently live slave processes if you can query them while your query is still running (it would be nice if Oracle were to introduce a “v$px_sesevent”)
  • The SQL Monitor screen in OEM (or the dbms_sql_tune.report_sql_monitor() procedure is fantastic (but a little broken) if you’ve paid the licences for the diagnostic and performance packs.

Here’s part of the text output from a call to dbms_sql_tune.report_sql_monitor() for a slightly different query – one that doesn’t include any predicate at all on the t1 table:


SQL Plan Monitoring Details (Plan Hash Value=3828285674)
==================================================================================================================================================
| Id |            Operation             |   Name   |  Rows   | Cost |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                                  |          | (Estim) |      | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT                 |          |         |      |         1 |   +256 |    11 |     1000 |       |          |                 |
|  1 |   PX COORDINATOR                 |          |         |      |         1 |   +256 |    11 |     1000 |       |          |                 |
|  2 |    PX SEND QC (RANDOM)           | :TQ30001 |      49 |   60 |         1 |   +256 |     5 |     1000 |       |          |                 |
|  3 |     HASH UNIQUE                  |          |      49 |   60 |         1 |   +256 |     5 |     1000 |    7M |          |                 |
|  4 |      PX RECEIVE                  |          |      49 |   60 |         1 |   +256 |     5 |     4506 |       |          |                 |
|  5 |       PX SEND HASH               | :TQ30000 |      49 |   60 |         1 |   +256 |     5 |     4506 |       |          |                 |
|  6 |        HASH UNIQUE               |          |      49 |   60 |        68 |   +189 |     5 |     4506 |    7M |          |                 |
|  7 |         PX BLOCK ITERATOR        |          |      50 |   14 |        68 |   +189 |     5 |    10000 |       |          |                 |
|  8 |          TABLE ACCESS FULL       | T1       |      50 |   14 |        68 |   +189 |    63 |    10000 |       |          |                 |
|  9 |           SORT AGGREGATE         |          |       1 |      |         1 |   +187 |     1 |        1 |       |          |                 |
| 10 |            PX COORDINATOR        |          |         |      |         1 |   +187 |     4 |        3 |       |          |                 |
| 11 |             PX SEND QC (RANDOM)  | :TQ10000 |       1 |      |         3 |   +185 |     3 |        3 |       |          |                 |
| 12 |              SORT AGGREGATE      |          |       1 |      |        74 |   +114 |     3 |        3 |       |          |                 |
| 13 |               PX BLOCK ITERATOR  |          |   20000 |   18 |        74 |   +114 |     3 |    20000 |       |          |                 |
| 14 |                TABLE ACCESS FULL | T2       |   20000 |   18 |        74 |   +114 |    37 |    20000 |       |          |                 |
| 15 |           SORT AGGREGATE         |          |       1 |      |         1 |   +118 |     1 |        1 |       |          |                 |
| 16 |            PX COORDINATOR        |          |         |      |         1 |   +118 |     3 |        2 |       |          |                 |
| 17 |             PX SEND QC (RANDOM)  | :TQ20000 |       1 |      |         2 |   +111 |     2 |        2 |       |          |                 |
| 18 |              SORT AGGREGATE      |          |       1 |      |       111 |     +2 |     2 |        2 |       |          |                 |
| 19 |               PX BLOCK ITERATOR  |          |   20000 |   27 |       111 |     +2 |     2 |    20000 |       |          |                 |
| 20 |                TABLE ACCESS FULL | T3       |   20000 |   27 |       111 |     +2 |    26 |    20000 |       |          |                 |
==================================================================================================================================================

There’s a little inaccuracy in timings – in the order of the odd second or two – probably because of the granularity used, but the columns “Start Active” and “Time Active (s)” are, nevertheless, very revealing. The first one tells you when a line in the plan first did some work, the second one tells you how long that line has been active. As I walk through the plan remember that the timing granularity is fairly crude, so the numbers don’t quite add up as nicely as we might like.

If you check lines 17 – 20 you can see that DFO tree 2 (:TQ2xxxx) was the first thing to go active with its PX slaves starting at time t = 2 seconds and running for 111 seconds to do that scan and initial aggregation; the PX send line started at time t = 111 seconds and ran for 2 seconds to pass the data from the slaves to the coordinator.

There’s a little glitch (or hand waving moment) around lines15 and 16 where the PX Coordinator receives and aggregates data from the PX slaves – the co-ordinator seems to start doing this several seconds later than it should.

In lines 11 – 14 you can see that DFO tree 1 (:TQ1xxxx) was the second thing to go active with its PX slaves starting at time t = 114 (i.e. just after DFO tree 2 completes);  the full scan and initial aggregate ran for 74 seconds (taking us to time t = 188). The PX send started (a little early, this time) at t = 185 and ran for 3 seconds, with the co-ordinator starting at t = 187 and taking one second to receive and do the final aggregation.

In lines 2 – 8 we can see DFO tree 3 (:TQ3xxxx) starting up, but this DFO tree includes two DFOs. One set of slaves scans table t1 applying the predicates to filter the data then does the initial hash distinct before hash distributing the intermediate result to the next set of slaves that finish off the distinct aggregation.

When DFO :TQ30000 starts at t = 189, the previous DFO trees have completed and the PX coordinator has the results of the two subqueries which it passes to the slave set which can now scan and check for: “n1 >= 500 and  n2 > {derived constant} and n3 > {derived constant}”. The scan and initial aggregation takes 68 seconds completing at t = 255, and at t = 256 the results are distributed to the second set of PX slaves. The second set of slaves has virtually nothing to do and reports only 1 second of execution time before passing the data (DFO :TQ30001) to the query coordinator which does the final aggregation and report.

A key point to note is that this timing information tells us that (a) the subqueries both completed before the tablescan of t1 started, and – perhaps more importantly – that we cannot rely on the TQXnnnn numbers to tell us the order of operation of the DFO trees. In this example DFO tree 2 ran first, DFO tree 1 ran second, and DFO tree 3 ran third – some queries might happen to show the names and the order in synch, that would be a coincidence, not a deliberate design feature.

A little reminder, though – even though we cannot rely on the DFO tree names to tell us the ordering in which the DFO trees operate, when we look at the data flow operations inside the trees (i.e. the last few digits of a TQ name) the order of operation withing a given tree matches the sequence indicated by the TQ name.

Counting Slaves.

There’s more in the Monitor report that can help us understand the workload – and in this case I’ve got a lucky coincidence to help, and a little surprise to go with it.  For parallel queries the report includes a summary of parallel execution activity, and this is what I got from my example:

Parallel Execution Details (DOP=5 , Servers Allocated=15)
===========================================================================================================================
|      Name      | Type  | Group# | Server# | Elapsed |   Cpu   | Concurrency | PL/SQL  |  Other   | Buffer | Wait Events |
|                |       |        |         | Time(s) | Time(s) |  Waits(s)   | Time(s) | Waits(s) |  Gets  | (sample #)  |
===========================================================================================================================
| PX Coordinator | QC    |        |         |    0.03 |    0.02 |             |         |     0.02 |     15 |             |
| p00a           | Set 1 |      1 |       1 |    0.27 |    0.27 |             |    0.10 |          |    163 |             |
| p00b           | Set 1 |      1 |       2 |    0.25 |    0.25 |             |    0.09 |          |    156 |             |
| p00c           | Set 1 |      1 |       3 |    0.26 |    0.26 |             |    0.09 |          |    156 |             |

| p00a           | Set 1 |      2 |       1 |    0.43 |    0.43 |             |    0.15 |          |    221 |             |
| p00b           | Set 1 |      2 |       2 |    0.44 |    0.44 |             |    0.15 |          |    221 |             |

| p000           | Set 1 |      3 |       1 |    0.00 |    0.00 |             |         |          |        |             |
| p001           | Set 1 |      3 |       2 |    0.00 |    0.00 |             |         |          |        |             |
| p002           | Set 1 |      3 |       3 |    0.01 |    0.01 |             |         |          |        |             |
| p003           | Set 1 |      3 |       4 |    0.01 |    0.01 |             |         |          |        |             |
| p004           | Set 1 |      3 |       5 |    0.01 |    0.01 |             |         |          |        |             |
| p005           | Set 2 |      3 |       1 |    0.26 |    0.26 |             |    0.09 |          |    113 |             |
| p006           | Set 2 |      3 |       2 |    0.24 |    0.24 |        0.00 |    0.08 |          |    108 |             |
| p007           | Set 2 |      3 |       3 |    0.25 |    0.25 |             |    0.09 |          |    117 |             |
| p008           | Set 2 |      3 |       4 |    0.25 |    0.25 |             |    0.09 |          |    108 |             |
| p009           | Set 2 |      3 |       5 |    0.25 |    0.25 |        0.00 |    0.09 |          |    117 |             |
===========================================================================================================================

I’ve inserted two blank lines breaking the summary down into three separate groups, which you can identify by the heading “Group#”. In this table we see Group 1 has one slave set of three slaves – which corresponds to the tablescan of t2; Group 2 consists of one slave set of two slaves – which corresponds to the tablescan of t3; and Group 1 has two slave sets of 5 slaves each – which correspond to the tablescan and aggregation of t1. The Group numbers appear to align correctly with the DFO tree numbers.

Another detail that stands out from this list is that slaves p00a and p00b are used in the tablescan of t3, and in the tablescan of t2 (where they are joined by slave p00c). In this example Oracle has reused the slaves from one subquery to run the next. However we can also see that the slaves p000p009 that are used for the tablescan of t1 don’t have names that overlap with the slaves used to scan t2 and t3 – which tells use that we have some (in this case two) DFO trees running concurrently.

Another key point in this article is that not only does a DOP of N mean we could run 2*N slaves concurrently in a parallel query, if we have a query that breaks into a plan that uses multiple DFO trees we might 2 * N * {number of DFO trees) allocated and running concurrently. (And, as we see here, it’s even possible that different DFO trees run with different DOPs, leaving Oracle very confused about what to report as the DOP of the query – dbms_xplan actually reported this one as DOP = 4 !)

A final implementation detail that is not necessarily obvious from this table, but which I can infer because I know the history of what was going on: although, as described above, the tablescan of t3 was the first activity that generated a row source, Oracle started up the 10 slaves for the parallel tablescan and aggregate of t1 before it started the two slaves it needed to scan t3. I can infer this because there were no slave processes active when I started running the query, so it is reasonable to assume that the slave names tell me something about the order in which they were allocated. In general you would not be able to notice this in a busy system where slaves were pre-allocated (parallel_min_servers != 0) and constantly being acquired and released. (I was also able to corroborate some of the details above by looking at v$px_stat and v$session_event for the relevant slave processes shortly after the query started.)

Summary

There are a number of key points this example shows us about complex parallel queries:

  • A single query can produce multiple “DFO trees”
  • Each tree can run at a different degree of parallelism
  • The “tree number” (the X in TQXnnnn) may not reveal the order in which the trees are operated
  • The order of operation of the DFOs within a tree is revealed by the nnnn in TQXnnnn
  • It is not easy to see in a basic execution plan which DFO tree might be executing concurrently and which consecutively
  • PX server processes may be acquired by the query co-ordinator a long time before they actually become active

Footnote:

If you happen to have looked closely at the time information in the Parallel Execution Details you’ll note that the times are total rubbish when compared to the monitoring times. It looks as if the time spent in dbms_lock.sleep() has been ignored (perhaps as an “idle” wait).

 

December 10, 2015

Upgrades

Filed under: 12c,Oracle,Upgrades — Jonathan Lewis @ 8:42 am BST Dec 10,2015

One of the questions that came up at the Optimizer Round Table this year was about minimizing the performance-related** hassle of upgrading from 11g to 12c. Dealing with changes in the optimizer is always an an interesting problem but in 12c this is made more challenging because of the automatic dynamic sampling that can introduce a significant amount of extra work at (hard) parse time, then generate SQL Directives, and finally generate extended (column group) statistics the next time you (or the automatic job) collect stats.

Of course one of the traditional strategies for upgrades (especially if you don’t really have a technical need to, but want your Oracle version to be current) is to set the optimizer_features_enable parameter to the older version and hope that this stops most of the new features from appearing, letting other customers identify the problems and raise the SRs that result in the next patch set.

There’s also the option for reading the manuals and white-papers carefully and identifying the new and enhanced optimizer features that might cause you problems if only you can spot them in time in your code. (The enhancement to subquery unnesting I described in my previous post is such an example – do you know of any code in your application that was carefully engineered to maximise the benefit of scalar subquery caching, if so you should check very carefully that 12c isn’t going to transform away your subquery and wreck your performance.)

As a general strategy, though, it’s worth reviewing your code for four generic features – histograms, multi-column joins, predicates that apply a function to a column, and “out-of-range” issues. The first three are features that are likely to make the 12c optimizer decide to do something “dynamic” that may ultimately give you a better execution plan, but may cost you extra resources or result in increased contention at just the wrong moment; the fourth is one that can disrupt the solution to the multi-column problem.

For the purposes of this note I am going to assume that you don’t use partitioned table or that you have already devised a programmatic method for minimising the work while maximising the effectiveness of their stats (and that your programs unlock and relock the stats so that they aren’t re-gathered by accident).

Histograms

I’m assuming by now that you’re using the approximate_ndv method with auto_sample_size to gather object statistics. In 11g this still means that any histograms gathered by Oracle will use a (surprisingly small) sample and even if you are gathering the histograms at the right time they could introduce a lot of instability to execution plans.

In 12c, Oracle can use an approximate method to create a frequency or Top-N frequency histogram while doing the scan that generates all the other stats – so if you have columns where you know the number of distinct values is relatively small (default up to 254, though technically up to 2048) or that almost all the data (except, broadly speaking, one bucket’s worth) comes from a relatively small number of distinct values then, on the upgrade, you should be safe creating a frequency or Top-N histogram once on those columns and setting the method_opt to ‘for all columns size repeat’.

Oracle 12c still uses a small sample size, though, when it detects the need for a “hybrid” histogram – which is the thing that replaces the height-balanced histogram. This means you may still need to write code to generate the histogram rather than allowing Oracle to collect it. It’s the classic compromise problem – a good histogram will be expensive to gather, a cheaply gathered histogram could easily be inaccurate and unstable. In this case you may want to create a Top-N frequency histogram that is a good model of the data, rather than trying to create a correctly structured hybrid. For tables with such columns you need code that can detect changes to the stats and recreates the histogram. In fact, even in cases where Oracle could gather a (frequency) histogram efficiently and accurately, you may still want to detect changes to stats for some columns and write code to create a histogram that describes the data the way it will look when the users are accessing it rather than the way the data looked when Oracle created the histogram (at 2:00 a.m. or whenever the automatic stats collection job hit it).

Broad strategy: set the global preference to method_opt => ‘for all columns size 1’, set a few table preferences to ‘for all columns size repeat’, and have code that checked the last_analyzed date on a few tables and recreates the histogram you want immediately after the stats have been gathered.

Note: as a general guideline, if you know the database needs histograms to produce the most appropriate execution plans this means the front-end code has to co-operate with the database and not conceal useful information in SQL that uses bind variables that hide the significance of special values or ranges – but that’s a topic for another blog note (if I haven’t written anything about it before).

Virtual Columns and Column Groups

In 12c column expressions (where the optimizer guesses) and multi-column predicates (where the optimizer assumes independence) are key triggers to dynamic sampling and SQL Directives. To engage in a pre-emptive strike the ideal is to know the code and know the data. Search for “important” SQL that applies functions (like trunc(date_col)) to columns in predicates or does multi-column (equality) joins or uses filter predicates that reference multiple columns from the same table.

If the execution plans you find for these statements produce estimates which are clearly far from reality then you may need to take pre-emptive action, even (or, perhaps, especially) if the resulting plans look good. The optimizer may (for example) know that it is guessing when it says to itself: “trunc(delivery_date) = trunc(sysdate) will return 1% of the data” and do several things such as create an adaptive execution plan and switch plans in mid-execution, do a load of dynamic sampling to discover the guess was wrong, dump an SQL Plan Directive into the data dictionary that triggers subsequent dynamic sampling and then create a virtual column through the extended stats mechanism.

In cases like this you may want to create and document virtual columns explicitly, and create column groups explicitly before Oracle does its dynamic thing. In the case of the column groups, you only have to worry about equality predicates, a column group cannot be used with range-based predicates. Remember you are only allowed a maximum of 20 column groups per table (although there is, effectively, no limit on the number of virtual columns) so you need to make your choices of columns groups before Oracle goes onto auto-pilot.

Remember, there’s a trap waiting here if you don’t make this effort.  You might run a test suite a couple of times and fix everything that seems to be necessary without realising that in the course of your testing Oracle has created a couple of hundred column groups and virtual columns. Then, when everything is working nicely, you upgrade the production system and all hell breaks loose until the production system has generated the same (we hope) set of virtual columns and column groups. Since there is a limit of 20 column groups per table (or column groups / 10 if you have more than 200 columns) you could get unlucky and find that Oracle wants to create 30 columns groups and the 20 that happen to be created first on production may not be the ones it created on the test system.

Out of Range

When a predicate specifies a value that is above the known high value or below the known low value for a column, the optimizer uses a “linear decay” method for cardinality calculations that is based on how far out of the range the value is (as a percentage of the range). In many cases this makes introduces a slowly increasing error (though for some predicates it can create an immediate catastrophe). There is a special consideration, though,  that means you need to be very careful about time-based or sequence-based columns that can go out of range: if you have a column group that includes the column then the optimizer stops using the column group the moment you go out of range. This is probably a rare condition to check for because you probably use range-based predicates on such columns (e.g. “order_date < trunc(sysdate) – 7”) and column groups are not applicable to range-based predicates anyway, but you do need to be aware that in special cases an execution plan can go bad the moment you overshoot the high value.

For columns like these you need a strategy that allows you to set a high (or low) value that reflects a future high value for the column – even to the extent (for example) of running a scheduler task that calls dbms_stats.set_column_stats() every hour to push the high value forward a little bit.

Update [11th Dec 2015]

Since this note is about preemptive strategies for reducing the number of problems you run into with 12c, it’s worth reminding you about the new “table preference” which you can use to get more realistic values for the clustering_factor on indexes. For example:


begin
        dbms_stats.set_table_prefs(user, 't1', 'table_cached_blocks', 16);
        dbms_stats.gather_index_stats(user, 't1_i1');
end;
/

Setting the table_cached_blocks preference for a table means Oracle will remember a history of recently “visited” table blocks as it walks an index to generate the clustering_factor. Although there is no official word, I think that 16 is a good default value for this setting in single instance Oracle, and 16 * {number of instances} might be appropriate for RAC.

The fragment above is just for demo purposes – you wouldn’t set the value every time you gather stats on an index, it’s a one-off exercise for each table – though you could change it for all existing tables in a schema with a single call to dbms_stats.set_schema_prefs().

The reason why this preference becomes more important in 12c is that the more expensive an execution path is the longer the optimizer will work to find a better path – and bad settings for the clustering_factor make otherwise good paths looks expensive and encourage the optimizer to try all sorts of before deciding on the final path.

Wrap-up

This is just a brief note to pick up a few key points that you can identify as potential threats before you start upgrading. Obviously it will also be of some help after you’ve upgraded (and are still in test mode) to help you explain some of the changes to execution plans and increases in parse times that you are likely to see. The benefit of pre-emptive action, though, is that you may be able to minimise the number of hidden actions (creation of directives and extended stats) that Oracle takes during your testing. It used to be hard enough doing an upgrade in the past, going up to 12c it can be even harder because Oracle may have changed your test database behind your back when you weren’t looking leaving you running a production system that doesn’t match the system you’ve tested.

** Footnote: if you went to Tim Hall’s presentation you’ll know that part of your upgrade process should deal with the move to the multi-tenant architecture, even if you adopt just the single-PDB licence.

 

December 9, 2015

12c Scalar Subquery

Filed under: 12c,Execution plans,Oracle — Jonathan Lewis @ 2:25 pm BST Dec 9,2015

Every version of the optimizer enhances existing mechanisms and introduces new features and 12c has introduced some of the most sophisticated transformation to date; in this note I want to demonstrate an enhancement to subquery unnesting that could give a significant performance boost to a certain query pattern but which might, unfortunately, result in worse performance.

Historically subquery unnesting turned subqueries (correlated or not) in the where clause into joins. In 12c subquery unnesting can also turn scalar subqueries in the select list into joins – we’ll discuss why this could be a good thing but might occasionally be a bad thing later on in the article, but let’s start with a test case.

Sample data.

In my demonstration I’m going to use three tables which, for convenience, are three clones of the same data.

create table t1
as
with generator as (
	select
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	rownum				id,	
	mod(rownum-1,100)		mod100,
	trunc((rownum - 1)/100)		trunc100,
	rownum				n1,
	rownum				n2,
	lpad(rownum,6,'0')		vc1,
	rpad('x',100)			padding
from
	generator
where
	rownum <= 10000
;

create table t2 as select * from t1;
create table t3 as select * from t1;

create index t1_i1 on t1(id);
create index t2_i1 on t2(id,mod100);
create index t3_i1 on t3(id,trunc100);

begin
	dbms_stats.gather_table_stats(user,'t1');
	dbms_stats.gather_table_stats(user,'t2');
	dbms_stats.gather_table_stats(user,'t3');
end;
/

I’ll be examining a query against t1 that includes two correlated scalar subqueries in the select list that reference one each of t2 and t3:


explain plan for
select
	/*+
		qb_name(main)
	*/
	n1, n2,
	(
		select	/*+ qb_name(sq1) */
			max(mod100)
		from	t2
		where	t2.id = t1.id
	) new_n1,
	(
		select	/*+ qb_name(sq2) */
			max(trunc100)
		from	t3
		where	t3.id = t1.id
	) new_n2
from
	t1
where
	t1.id between 101 and 200
;

select * from table(dbms_xplan.display);

11g Plan

This is the execution plan you might expect to see from 11g – in my case, with my system stats etc. and running 11.2.0.4:


--------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |   101 |  1212 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE              |       |     1 |     7 |            |          |
|   2 |   FIRST ROW                  |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN (MIN/MAX)| T2_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   4 |  SORT AGGREGATE              |       |     1 |     7 |            |          |
|   5 |   FIRST ROW                  |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  6 |    INDEX RANGE SCAN (MIN/MAX)| T3_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   7 |  TABLE ACCESS BY INDEX ROWID | T1    |   101 |  1212 |     4   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN           | T1_I1 |   101 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"=:B1)
   6 - access("T3"."ID"=:B1)
   8 - access("T1"."ID">=101 AND "T1"."ID"<=200)

As you can see, the operations for the subqueries appear first in the plan (lines 1-3, and 4-6), with the operations for the main query appearing as the last section of the plan (lines 7-8). You might note that the total cost of the plan doesn’t cater for the cost of the subqueries – technically we might expect to see the optimizer producing a cost of something like 408 on the basis that it’s going to run each subquery an estimated 101 times and each subquery has a cost of 2, and the 101 rows are generated from a query with a cost of 4 giving: 4 + 101 * (2 + 2) = 408.

12c Plan

On the upgrade to 12c, the same code produces the following plan:


--------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |          |   101 |  6464 |     8   (0)| 00:00:01 |
|*  1 |  HASH JOIN OUTER                      |          |   101 |  6464 |     8   (0)| 00:00:01 |
|*  2 |   HASH JOIN OUTER                     |          |   101 |  3838 |     6   (0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |   101 |  1212 |     4   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                  | T1_I1    |   101 |       |     2   (0)| 00:00:01 |
|   5 |    VIEW                               | VW_SSQ_2 |   101 |  2626 |     2   (0)| 00:00:01 |
|   6 |     HASH GROUP BY                     |          |   101 |   707 |     2   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN                 | T2_I1    |   101 |   707 |     2   (0)| 00:00:01 |
|   8 |   VIEW                                | VW_SSQ_1 |   101 |  2626 |     2   (0)| 00:00:01 |
|   9 |    HASH GROUP BY                      |          |   101 |   707 |     2   (0)| 00:00:01 |
|* 10 |     INDEX RANGE SCAN                  | T3_I1    |   101 |   707 |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("ITEM_1"(+)="T1"."ID")
   2 - access("ITEM_2"(+)="T1"."ID")
   4 - access("T1"."ID">=101 AND "T1"."ID"<=200)
   7 - access("T2"."ID">=101 AND "T2"."ID"<=200)
  10 - access("T3"."ID">=101 AND "T3"."ID"<=200)

As you can see, the separate plans for the subqueries have disappeared and the plan is showing a three-table join (with two outer joins, and two of the “tables” being the non-mergeable view vw_ssq_2 and vw_ssq_1).

There are several details to pick up in this plan (apart from the unnesting). The cost is only 8 – but in this case it isn’t the effect of the optimizer “ignoring” the cost of the subqueries, it’s the optimizer correctly working out the cost of the unnested subqueries with joins. The cost happens to be low in this case because the optimizer has used transitive closure to pass the predicate from the driving query into the subqueries – so we need only do a couple of short index range scans to get all the data we need in the unnested subqueries.

The outer joins are needed because it is valid for the original scalar subquery mechanism to return no data for a subquery and still report a row (with nulls) for t1. If the rewrite didn’t introduce the outer join then t1 rows for which there were no matching t2 or t3 rows would disappear from the result set.

Threats and workarounds

In this (lightweight) example it looks as if this transformation is a good idea, but it’s always possible that the optimizer might choose to do this when it’s a bad idea. In fact, a quick check of the optimizer trace (10053) suggests that this is an uncosted transformation that will take place “because it can”. Here are six highly suggestive consecutive lines from the trace file:


SU: Unnesting query blocks in query block MAIN (#1) that are valid to unnest.
Subquery Unnesting on query block MAIN (#1)
SU: Performing unnesting that does not require costing.
SU: Considering subquery unnest on query block MAIN (#1).
SU:   Unnesting  scalar subquery query block SQ2 (#2)
Registered qb: SEL$2E540226 0x50b1a950 (SUBQ INTO VIEW FOR COMPLEX UNNEST SQ2)

Even if this transformation is cost-based rather than heuristic it’s always possible for the optimizer to make a very poor estimate of cost and do the wrong thing. Fortunately it’s possible to block the unnesting with the “traditional” /*+ no_unnest */ hint:


select
	/*+
		qb_name(main)
		no_unnest(@sq1)
		no_unnest(@sq2)
	*/
	n1, n2, 
...

With these hints in place the execution plan changes back to the 11g form – though there is a notable change in the estimated final cost of the query:


---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |   101 |  1212 |   206   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                     |       |     1 |     7 |            |          |
|   2 |   FIRST ROW                         |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN (MIN/MAX)       | T2_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   4 |  SORT AGGREGATE                     |       |     1 |     7 |            |          |
|   5 |   FIRST ROW                         |       |     1 |     7 |     2   (0)| 00:00:01 |
|*  6 |    INDEX RANGE SCAN (MIN/MAX)       | T3_I1 |     1 |     7 |     2   (0)| 00:00:01 |
|   7 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |   101 |  1212 |     4   (0)| 00:00:01 |
|*  8 |   INDEX RANGE SCAN                  | T1_I1 |   101 |       |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T2"."ID"=:B1)
   6 - access("T3"."ID"=:B1)
   8 - access("T1"."ID">=101 AND "T1"."ID"<=200)

It’s a little surprising that the final cost is (4 + 202) rather than the  (4 + 404) that we calculated earlier, but a few variations on the test suggest that the optimizer is using half the cost of each of the scalar subqueries in the final cost estimate – perhaps as a nod in the direction of scalar subquery caching.

As always it is important to remember that you cannot look at a query like this and know which plan is going to perform better unless you know the data content, the pattern of data distribution, and the probably effect of scalar subquery caching. In some cases it may be that an extremely “lucky” data pattern will mean that a scalar subquery will run a very small number of times while, with the same data arranged in a different order the benefit of scalar subquery caching will disappear and the unnesting approach may be much better.

It is a convenience to be able to recognize that there is a new optimizer feature in 12c that might give you a performance boost but might hit you with an unlucky performance penalty (especially in cases, perhaps, where you have previously engineered your code very carefully to take advantage of scalar subquery caching) and know that there is a workaround available. In your search for pre-empting production performance problems, this code structure is another pattern to scan for in your source code.

December 3, 2015

Five Hints

Filed under: Hints,Oracle,Uncategorized — Jonathan Lewis @ 7:40 am BST Dec 3,2015

This is the content of a “whitepaper” I wrote for my presentation “Five Hints for Optimising SQL” at the recent DOAG conference.

Introduction

Adding hints to production code is a practice to be avoided if possible, though it’s easy to make the case for emergency patching, and hinting is also useful as the basis of a method of generating SQL Plan Baselines. However, notwithstanding (and sometimes because of) the continuing enhancements to the optimizer, there are cases where the only sensible option for dealing with a problem statement is to constrain the broad brush strategy that the optimizer can take in a way that allows it to find a reasonable execution plan in a reasonable time.

This note describes in some detail the use and effects of five of the “classic” hints that I believe are reasonable strategic options to redirect the optimizer when it doesn’t choose a path that you consider to be the most appropriate choice.

The Big Five

At the time of writing, a query against the view v$sql_hint on Oracle 12.1.0.2 reports 332 hints – but there are very few which we should really consider as safe for production code, and it’s best to view even those as nothing more than a medium-term tool to stabilise performance until the optimizer is able to do a better job with our SQL.

The handful of hints that I tend to rely on for solving problems is basically a set of what I call “structural” queries though in recent years it has become appropriate to label them as “query block” hints. These are hints that give the optimizer some idea of the shape of the best plan without trying to enforce every detail of how it should finalize the plan. The hints (with their negatives where appropriate) are:

  • Unnest / no_unnest — Whether or not to unnest subqueries
  • Push_subq / no_push_subq — When to handle a subquery that has not been unnested
  • Merge / no_merge — Whether to use complex view merging
  • Push_pred / no_push_pred — What to do with join predicates to non-merged views
  • Driving_site — Where to execute a distributed query

Inevitably there are a few other hints that can be very helpful, but a key point I want to stress is that for production code I avoid what I call “micro-management” hints (such as use_nl(), index_rs_asc()) – attempts to control the optimizer’s behaviour to the last little detail; it is very easy to produce massive instability in performance once you start down the path of micro-managing your execution plans, so it’s better not to try.

The rest of this document will be devoted to describing and give examples of these hints.

The Optimizer’s Strategy

You can think of the optimizer as working on a “unit of optimization” which consists of nothing more than a simple statement of the form:

select  list of columns
from    list of tables
where   list of simple predicates

To deal with a more complex query the optimizer stitches together a small number (reduced, if it had its way, to just one) of such simple blocks. So one of the first steps taken by the optimizer aims to transform your initial query into a this simple form. Consider this example:


select
        t1.*,v1.*,t4.*
from
        t1,
        (
        select
                t2.n1, t3.n2, count(*)
        from    t2, t3
        where exists (
                select
                        null
                from    t5
                where   t5.id = t2.n1
                )
        and     t3.n1 = t2.n2
        group by t2.n1, t3.n2
        )       v1,
        t4
where
        v1.n1 = t1.n1
and     t4.n1(+) = v1.n1
;

We have an inline view consisting of a two-table join with a subquery correlated to the first table, and from our perspective we have a “simple join” of three objects – t1, v1, and t4. Before it does anything else the optimizer will try to transform this into a straight-line five-table join so that it can join all the tables in order one after the other. As part of that process it will generally attempt to eliminate subqueries in a processing known as unnesting.

Looking at the query as it has been presented author of the code may have been thinking (symbolically) of the underlying problem as:

  • ( ( t1, ( ( t2, subquery t5 ), t3 ) ), t4 )

Take t1, join to it the result of applying the subquery to t2 and joining t3, then join t4.

The optimizer may decide to transform to produce the following:

  • ( ( ( ( t1, t2 ), t3 ), {unnested t5} ), t4 )

Join t2 to t1, join t3 to the result, join the transformed t5 to the result, then join t4 to the result.

If I decide that the original layout demonstrates the appropriate mechanism, my target is to supply the optimizer with just enough hints to lock it into the order and strategy shown, without trying to dictate every little detail of the plan. My hints would look like this:

select
        /*+
            qb_name(main) push_pred(v1@main)
            no_merge(@inline)
            no_unnest(@subq1) push_subq(@subq1)
        */
        t1.*,v1.*,t4.*
from
        t1,
        (
        select  /*+ qb_name(inline) */
                t2.n1, t3.n2, count(*)
        from    t2, t3
        where exists (
                select  /*+ qb_name(subq1) */
                        null
                from    t5
                where   t5.id = t2.n1
                )
        and     t3.n1 = t2.n2
        group by t2.n1, t3.n2
        )       v1,
        t4
where
        v1.n1 = t1.n1
and     t4.n1(+) = v1.n1
;

I’ve labelled the three separate select clauses with a query block name (qb_name() hint), told the optimizer that the query block named “inline” should be considered as a separately optimized block (no_merge(@inline)), and the subquery inside that block called “subq1” should be treated as a filter subquery (no_unnest(@subq1)) and applied as early as possible (push_subq(@subq1)).

In some circumstances I might use one more hint to tell the optimizer to consider a single join order: t1, v1, t4 using the hint /*+ leading(t1 v1 t4) */; but in this case I’ve told the optimizer to push the join predicate v1.n1 = t1.n1 inside the view (push_pred(@inline)) – which will make the optimizer do a nested loop from table t1 to view v1, resolving the view for each row it selects from t1.

Having captured 4 of the “big 5” hints in one sample statement, I’ll now comments on each of them (and the final driving_site() hint separately).

Merge / No_merge

This pair of hints apply particularly to “complex view merging”, but can be used to “isolate” sections of a query, forcing the optimizer to break one large query into a number of smaller (hence easier) sections. I see two main uses for the hints (and particularly the no_merge option) – one is to help the optimizer get started when handling a query with a large number of table, the other is simply to block a strategy that the optimizer sometimes chooses when it is a bad move.

Consider, in the first case, a query involving 20 tables, with several subqueries. With such a long list it is very easy for the optimizer to pick a very bad starting join order and never reach a good join order; moreover, because of the multiplicative way in which the optimizer estimates selectivity it’s very easy for the optimizer to decide after a few tables that the cardinality of the join so far is so small that it doesn’t really matter which table to access next. In cases like this we might start by writing a simpler query joining the first four of five tables that we know to be the key to the whole query – once we have got the core of the query working efficiently we can “wrap” it into an inline view with a no_merge hint, and then join the rest of the tables to it, with some confidence that the optimizer will start well and that it can’t go far wrong with the remainder of the tables so, for example

select  ...
from    t1, t2, t3, ..., t20
where   {various predicates}
and     exists {correlated subquery1}
and     exists {correlated subquery2}
and     column in {non-correlated subquery}

Might become

with v1 as (
        select  /*+ no_merge cardinality(2000) */ ...
        from    t1, t2, t3, t4, t5
        where   {various predicates{
        and     exists {correlated subquery1}
)
select  ...
from    v1, t6, t7, ..., t20
where   {join conditions to v1}
and     {other join conditions}
and     exists {correlated subquery2}
and     column in {non-correlated subquery}
;

I’ve written the example up using subquery factoring; in earlier versions of Oracle the relevant piece of code would have been written as an inline view, but the “with” clause can help to tidy the SQL up and make it easier to see the logic of what’s being done – provided the practice isn’t taken to such extremes that the final query consists of large number of very small factored subqueries.

I’ve included a cardinality() hint in the factored subquery – it’s not fully documented, and it’s not commonly realised that it can be applied to a query block rather than to a table or list of tables. This query block usage is probably the safest example of using the hint – the table-related usage is badly understood and prone to mis-use.

As an example of blocking a badly selected transformation, consider the following query (where I’ve already included qb_name() hints to name the two separate query blocks):

select  /*+ qb_name(main) */
        t1.vc1, avg_val_t1
from    t1,
        (
        Select  /*+ qb_name(inline) */
                id_parent, avg(val) avg_val_t1 
        from	t2
        group by
                id_parent
        ) v1
where	
        t1.vc2 = 'XYZ'
and     v1.id_parent = t1.id_parent
;

There are two basic strategies the optimizer could use to optimize this query, and the choice would depend on its estimate of how much data it had to handle . Whichever choice it makes we might, depending on our better understanding of the data, want it to choose the alternative (without rewriting the query beyond hinting it).

One option is for Oracle to execute the inline view to generate the aggregate data v1 then join the result to t1; the other is to join t2 (the view’s underlying table) to t1 and then work out an aggregation of the join that would give the same result.

If I want to “join then aggregate” I would use the merge hint, if I wanted to “aggregate then join” I would use the no_merge hint. There are three different ways in which I could introduce the hint:

  • In the inline view itself I could simply add the hint “merge”
  • In the main query I could reference the view by view name “no_merge(v1)”
  • In the main query I could reference the inline query block name “no_merge(@inline)”

Note particularly the “@” symbol that I use to point a hint at a query block; and note that this was not needed when I reference the view name. (The reference by query block name is the more modern, preferred strategy.)

Push_pred / No_push_pred

Once we start dealing with non-mergeable views and have to join to them there are two strategies that we could use for the join; the first is (nominally) to create the entire data set for the view and then use that in a merge join or hash join based on the join predicate, or we could “push a join predicate” into the view definition – in other words for each join value we could add a simple filter predicate to the view definition and derive the view result based on that predicate. For example, if we create a database view called avg_val_view with a definition matching the inline view we used in the previous example, we might see one of two possible execution plans for the following query:

select  t1.vc1, avg_val_t1
from    t1, avg_val_view
where   t1.vc2 = 'XYZ'
and     avg_val_view.id_parent = t1.id_parent
;

First – if the view is non-mergeable and we don’t push the predicate, we can see the join predicate appearing at operation 1, as we do a hash join between table t1 and the entire result set from aggregating t2. This may be sensible, but it may be very expensive to create the entire aggregate:

-------------------------------------------------------------------
| Id | Operation            | Name         | Rows  | Bytes | Cost  |
--------------------------------------------------------------------
|  0 | SELECT STATEMENT     |              |     1 |    95 |    27 |
|* 1 |  HASH JOIN           |              |     1 |    95 |    27 |
|* 2 |   TABLE ACCESS FULL  | T1           |     1 |    69 |     2 |
|  3 |   VIEW               | AVG_VAL_VIEW |    32 |   832 |    24 |
|  4 |    HASH GROUP BY     |              |    32 |   224 |    24 |
|  5 |     TABLE ACCESS FULL| T2           |  1024 |  7168 |     5 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("AVG_VAL_VIEW"."ID_PARENT"="T1"."ID_PARENT")
   2 - filter("T1"."VC2"='XYZ')

So we may decide to add the hint /*+ push_pred(avg_val_view) */ to the query – we have to use the view-name method since we don’t have a query block containing the view; if we were using the inline view from the previous query we could have used the “query block” format /*+ push_pred(@inline) */. The plan from pushing predicates is:

--------------------------------------------------------------------
| Id | Operation               | Name        | Rows | Bytes | Cost |
--------------------------------------------------------------------
|  0 | SELECT STATEMENT        |             |    1 |    82 |    7 |
|  1 |  NESTED LOOPS           |             |    1 |    82 |    7 |
|* 2 |   TABLE ACCESS FULL     | T1          |    1 |    69 |    2 |
|  3 |   VIEW PUSHED PREDICATE | AVG_VAL_VIEW|    1 |    13 |    5 |
|* 4 |    FILTER               |             |      |       |      |
|  5 |     SORT AGGREGATE      |             |    1 |     7 |      |
|* 6 |      TABLE ACCESS FULL  | T2          |   32 |   224 |    5 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."VC2"='XYZ')
   4 - filter(COUNT(*)>0)
   6 - filter("ID_PARENT"="T1"."ID_PARENT")

It would actually be a bad idea in this particular case, but if we could access the rows for a given id_parent in t2 efficiently this query could be much more efficient than the previous plan because it would only aggregate the small number of rows that it was going to need at each point, with the smallest row size.

You might note that Oracle has cleverly introduced a filter as operation 4 to eliminate t1 rows where the aggregate would return a row with a zero when there was no matching data. It’s details like this that typical programmers tend to forget when trying to transform SQL by hand.

Unnest / No_unnest

The optimizer prefers joins to subqueries, and will generally try to transform a query to turn a subquery into a join – which often means a semi-join for existence/in, or an anti-join for not exists/not in). As the optimizer has improved with version many such transformations (or decisions to not transform) changed from being driven by rules to being driven by cost – and sometimes we want to override the optimizer because we know its cost calculation is bad. Most commonly we might want to write a query with a subquery – to show our intentions – but tell the optimizer to unnest the subquery: it’s much safer to take this approach rather than to rewrite the query in unnested form ourselves – I’ve seen people do the rewrite incorrectly too many times to trust a user-created rewrite. For example:

select
        /*+ qb_name(main) unnest(@subq) */
        outer.* 
from 
        emp outer
where   outer.sal > (
                select 
                        /*+ qb_name(subq) unnest */
                        avg(inner.sal) 
                from    emp inner 
                where 
                inner.dept_no = outer.dept_no
        )
;

I’ve show the unnest hint here, and demonstrated the two possible forms – you can either use it in the main query block hint to point it at a give query block name (@subq), or you can use it without a “parameter” in the query block you want unnested. In effect the unnest hint causes Oracle to rewrite the query as:

select
        outer.* 
from
        (
        select 
                dept_no, avg(sal) av_sal
        from    emp 
        group by 
                dept_no
        )               inner,
        emp             outer
where 
        outer.dept_no = inner.dept_no
and     outer.sal > inner.av_sal
;

You’ll notice that this gives us an in-line aggregate view, so the optimizer could take (or be pushed) one more step into doing complex view merging as well, joining emp to itself before aggregating on a very messy set of columns.

Here’s the plan if we unnest:

----------------------------------------------------------------
| Id  | Operation            | Name    | Rows  | Bytes | Cost  |
----------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |  1000 | 98000 |   114 |
|*  1 |  HASH JOIN           |         |  1000 | 98000 |   114 |
|   2 |   VIEW               | VW_SQ_1 |     6 |   156 |    77 |
|   3 |    HASH GROUP BY     |         |     6 |    48 |    77 |
|   4 |     TABLE ACCESS FULL| EMP     | 20000 |   156K|    36 |
|   5 |   TABLE ACCESS FULL  | EMP     | 20000 |  1406K|    36 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("ITEM_1"="OUTER"."DEPT_NO")
       filter("OUTER"."SAL">"AVG(INNER.SAL)")

Notice the appearance at operation 2 of a “view” names VW_SQ_1: there are a number of internal view names that appear in Oracle as it transforms queries – the fact that a view name starts with VW_ is a good clue that it’s an internal one. Note, in this particular case that the main work done in the query is the two tablescans of EMP.

Here’s the plan if we don’t unnest:

------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost  |
------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |   167 | 12024 |   252 |
|*  1 |  FILTER             |      |       |       |       |
|   2 |   TABLE ACCESS FULL | EMP  | 20000 |  1406K|    36 |
|   3 |   SORT AGGREGATE    |      |     1 |     8 |       |
|*  4 |    TABLE ACCESS FULL| EMP  |  3333 | 26664 |    36 |
------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("OUTER"."SAL"> (SELECT /*+ NO_UNNEST */
              AVG("INNER"."SAL") FROM "EMP" "INNER" 
              WHERE "INNER"."DEPT_NO"=:B1))
   4 - filter("INNER"."DEPT_NO"=:B1)

The FILTER at operation 1 tells us that nominally the optimizer will run the subquery once for every row in the emp table, but the optimizer costing (252) tells us that it thinks that really it will execute the table scan only 7 times in total (7 * 36 = 252): once for the driving scan and six more times because there are only six departments in my emp table.  (This “how many executions” type of estimate appeared in the costing calculations relatively recently.)

Push_subq / No_push_subq

Once we can control whether or not Oracle will unnest a subquery or run it as a filter we can then choose whether the subquery should run early or late. Historically the optimizer would always leave subqueries to the very end of query operation – but recently the choice of timing acquired a costing component. “Pushing” a subquery means pushing it down the execution tree – i.e. running it earlier in the plan. To demonstrate this we need a minimum of a two-table join with subquery:

select
        /*+ leading(t1 t2) push_subq(@subq) */
        t1.v1
from    t1, t3
where   t1.n2 = 15
and     exists (
                select  --+ qb_name(subq) no_unnest push_subq 
                        null
                from    t2
                where   t2.n1 = 15
                and     t2.id = t1.id
        )
and     t3.n1 = t1.n1
and     t3.n2 = 15
;

In this query I have a subquery where I’ve blocked unnesting, so it has to run as a filter subquery (in passing, I’ve use the alternative, less commonly known, format for hinting: the single-line hint/comment that starts with – – for a comment and – – + for a hint).

I’ve shown the push_subq hint (run the subquery early) in two different ways – first at the top of the query referencing the query block that I want pushed, and then in the subquery itself where it doesn’t need a parameter.

As you can see, the subquery is correlated to table t1 and I’ve told Oracle to examine only the join order t1 -> t3. The effect of the push_subq hint, therefore, is to tell Oracle to run the subquery for each row of t1 that it examines and join any survivors to t3. The alternative is for Oracle to join t1 to t3 and then run the subquery for every row in the result. Depending on the data and indexes available either option might be the more efficient.

Here are the two plans – first if I don’t push the subquery (note the FILTER operation):

--------------------------------------------------------------------
| Id | Operation                    | Name  | Rows | Bytes | Cost  |
--------------------------------------------------------------------
|  0 | SELECT STATEMENT             |       |    1 |    28 |   289 |
|* 1 |  FILTER                      |       |      |       |       |
|* 2 |   HASH JOIN                  |       |  173 |  4844 |   116 |
|* 3 |    TABLE ACCESS FULL         | T1    |  157 |  3140 |    57 |
|* 4 |    TABLE ACCESS FULL         | T3    |  157 |  1256 |    57 |
|* 5 |   TABLE ACCESS BY INDEX ROWID| T2    |    1 |     8 |     2 |
|* 6 |    INDEX UNIQUE SCAN         | T2_PK |    1 |       |     1 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter( EXISTS (SELECT /*+ QB_NAME ("SUBQ2") NO_UNNEST */ 0
              FROM "T2" "T2" WHERE "T2"."ID"=:B1 AND "T2"."N1"=15))
   2 - access("T3"."N1"="T1"."N1")
   3 - filter("T1"."N2"=15)
   4 - filter("T3"."N2"=15)
   5 - filter("T2"."N1"=15)
   6 - access("T2"."ID"=:B1)

Then if I push the subquery

--------------------------------------------------------------------
| Id |Operation                     | Name  | Rows | Bytes | Cost  |
--------------------------------------------------------------------
|  0 |SELECT STATEMENT              |       |    9 |   252 |   117 |
|* 1 | HASH JOIN                    |       |    9 |   252 |   115 |
|* 2 |  TABLE ACCESS FULL           | T1    |    8 |   160 |    57 |
|* 3 |   TABLE ACCESS BY INDEX ROWID| T2    |    1 |     8 |     2 |
|* 4 |    INDEX UNIQUE SCAN         | T2_PK |    1 |       |     1 |
|* 5 |  TABLE ACCESS FULL           | T3    |  157 |  1256 |    57 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T3"."N1"="T1"."N1")
   2 - filter("T1"."N2"=15 AND  EXISTS (SELECT /*+ QB_NAME ("SUBQ2")
              PUSH_SUBQ NO_UNNEST */ 0 FROM "T2" "T2" 
              WHERE "T2"."ID"=:B1 AND "T2"."N1"=15))
   3 - filter("T2"."N1"=15)
   4 - access("T2"."ID"=:B1)
   5 - filter("T3"."N2"=15)

Notice how the access to t2 has squeezed itself between t1 and t3 and is also indented one place as a clue that it is a subordinate action on t1, but the FILTER operation visible in the previous plan has disappeared. This plan is an example of a plan that doesn’t follow the well-known “first child first / recursive descent” guideline – Oracle has hidden the FILTER operation and twisted the plan slightly out of its “tradiational” shape as a consequence.

Driving_site

The final hint is for distributed queries, and has no “negative” version. Sometimes the only way you can “tune” a distributed query is to minimise the time spent on network traffic, and this means dictating WHERE the query executes. The driving_site hint lets you make that choice. (Sometimes, having made that choice you also have to include a leading() hint to tell Oracle about the single join order you want it to consider – it’s possible for the optimizer to do some very strange things with distributed queries, especially if the instances have different NLS settings).

Consider the following query (I’ll fill in the XXXX in the hint shortly):

select  /*+ driving_site (XXXX) */
        dh.small_vc,
        da.large_vc
from
        dist_home               dh,
        dist_away@remote_db     da
where
        dh.small_vc like '1%'
and     da.id = dh.id;

This query extracts a small amount of data from a table called DIST_HOME in the local database, and joins it to some data in a table called DIST_AWAY in a remote database, producing a reasonably large number of medium-sized rows. There are basically two obvious plans:

  • nested loop – for each row in dist_home, query dist_away for matching data
  • hash join – create an in-memory hash table from the dist_home data, and then probe it with data from all the rows in dist_away.

The first plan will produce a large number of network round trips – so that’s not very good; the second plan will pull a very large amount of data from the remote database if the query operates at the local database (it’s only the columns we need, but it will be ALL the rows from the remote database).

Choosing the second plan but executing it at the remote database means we’ll send a small parcel of data to the remote database, do the join there to produce (we hope) a reasonable result set, then send it back to the local database. The network traffic will be minimised without causing an undesirable increase in other resource usage. To make this plan happen all I needed to do in the query was change the XXXX in the driving_site() hint to reference a table alias from a table in the remote database, in this case driving_site(da).

Here’s the execution plan:

-----------------------------------------------------------------------
| Id | Operation              | Name     | Rows | Bytes | Inst |IN-OUT|
-----------------------------------------------------------------------
|  0 | SELECT STATEMENT REMOTE|          |  216 | 48600 |      |      |
|* 1 |  HASH JOIN             |          |  216 | 48600 |      |      |
|  2 |   REMOTE               | DIST_HOME|  216 |  4320 |    ! | R->S |
|  3 |   TABLE ACCESS FULL    | DIST_AWAY| 2000 |   400K| TEST |      |
-----------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
  1 - access("A1"."ID"="A2"."ID")

Remote SQL Information (identified by operation id):
----------------------------------------------------
  2 - SELECT "ID","SMALL_VC" FROM "DIST_HOME" "A2" WHERE "SMALL_VC" 
      LIKE '1%' (accessing '!')

Notice how the top line (id 0) includes the keyword REMOTE – this tells you that this is the plan from the viewpoint of the remote database/instance that will be executing it. Remember that from its viewpoint the database that we think is the local database is one that it thinks is remote – hence the REMOTE operation 2 which is addressing (our) local table DIST_HOME.

Other key points to note are the appearance of the Inst (instance) and IN-OUT columns. These tell you where each table is located – when a query executes remotely “our” database is tagged only by the name “!”.

A nice feature of the execution plan for a distributed query is that you can see how the query has been decomposed for execution at the remote site. In this case the other database will be sending our database the query at operation 2 to pull the rows it wants from small_vc so that it can do the join at its site and send the result back to us.

The thing you generally don’t want to see in more complex distributed queries is a separate query being generated for each remote table involved in the join – tables that live remotely should be joined remotely with just the join result being pulled back to the local database.

There is a special warning that goes with this hint – it isn’t valid for the select statements in “create as select” and “insert as select”. There seems to be no good reason for this limitation, but for CTAS and “insert as select” the query has to operate at the site of the table that is receiving the data. This means that you may be able to tune a naked SELECT to perform very well and then find that you can’t get the CTAS to use the same execution plan. A typical workaround to this problem is to wrap the select statement into a pipelined function and do a select from table(pipelined_function).

Conclusion

There are literally hundreds of hints available but, as a general guideline, there are only a few that are particularly useful and strategically sound. In this article I’ve listed the five hints that I’ve long considered to be the ones that are of most help and least risk. I have mentioned a couple of other hints in passing, and know that there are a couple of hints in the newer versions of Oracle that should eventually be added to the list; but the five I’ve mentioned give a sound basis to work from in understanding the benefits of using hints that shape the optimizer’s strategy for a query without trying to micro-manage it.

« Previous PageNext Page »

Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 6,668 other followers