Oracle Scratchpad

August 31, 2016

Parallel_index hint

Filed under: Bugs,CBO,Oracle,Parallel Execution — Jonathan Lewis @ 1:28 pm GMT Aug 31,2016

Prompted by a recent OTN posting I’ve dug out from my library the following demonstration of an anomalty with the parallel_index() hint. This note is a warning about  how little we understand hints and what they’re supposed to mean, and how we can be caught out by an upgrade. We’ll start with a data set which, to match a comment made in the origina posting rather than being a necessity for the demonstration, has an index that I’ve manipulated to be larger than the underlying table:


rem
rem     Script:         parallel_index_hint_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          December 2005
rem

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1e4)         modded,
        lpad(rownum,10,'0')     v1,
        lpad('x',30,'x')        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

create index t1_i1 on t1(modded) pctfree 75;
alter table t1 modify modded not null;

Your figures may vary slightly if you try to recreate this model, but according to my stats blocks=7876 for the table and leaf_blocks=8054 for the index. Now I’m going to generate the execution plans for a simple query – count(modded) from the table – with a varying selection of hints.


select /*+ index_ffs(t1 t1_i1) */ count(modded) from t1;

select /*+ parallel_index(t1 t1_i1 20) */ count(modded) from t1;

select /*+ index_ffs(t1 t1_i1) parallel_index(t1 t1_i1 20) */ count(modded) from t1;

select /*+ parallel_index(t1 t1_i1 20) parallel(t1 10) */ count(modded) from t1;

Pause for thought

  • which of the four statements will have a plan that uses an index fast full scan ?
  • which of the four statements will have a plan that indicates parallel execution ?
  • which of the four statements will indicate a parallel index fast full scan ?
  • why are two of the plans going to be identical but with different costs ?

The most interesting bit of this note is in the last question because it’s also the answer to a more subtle “why didn’t Oracle do what I thought it should” question. Here are the four plans I got from an instance of 11.2.0.4 (with a little labelling to remind us about the hinting):


==================================================
index_ffs() on it's own - get index fast full scan
==================================================

-----------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Cost (%CPU)| Time     |
-----------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |  1043   (4)| 00:00:06 |
|   1 |  SORT AGGREGATE       |       |     1 |            |          |
|   2 |   INDEX FAST FULL SCAN| T1_I1 |  1000K|  1043   (4)| 00:00:06 |
-----------------------------------------------------------------------

=====================================================
parallel_index() on it's own
Get serial tablescan which is cheaper than serial FFS
=====================================================

-------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Cost (%CPU)| Time     |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |  1027   (4)| 00:00:06 |
|   1 |  SORT AGGREGATE    |      |     1 |            |          |
|   2 |   TABLE ACCESS FULL| T1   |  1000K|  1027   (4)| 00:00:06 |
-------------------------------------------------------------------

========================================================
parallel_index() with index_ffs()
Get parallel fast full scan - at same cost as SERIAL FFS
========================================================

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |  1043   (4)| 00:00:06 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR    |          |  1000K|  1043   (4)| 00:00:06 |  Q1,00 | PCWC |            |
|   6 |       INDEX FAST FULL SCAN| T1_I1    |  1000K|  1043   (4)| 00:00:06 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------

===============================================
parallel_index() with parallel()
Get parallel fast full scan - costed correctly.
(Not costed at the degree given for table).
===============================================

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |    58   (4)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10000 |     1 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR    |          |  1000K|    58   (4)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       INDEX FAST FULL SCAN| T1_I1    |  1000K|    58   (4)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------

First plan – we supplied the index_ffs() hint, it’s valid: so Oracle did an index fast full scan. The serial cost was 1,043.

Second plan – we supplied the parallel_index() hint: but the optimizer apparently didn’t consider the cost of a parallel index fast full scan (which ought to have cost something like (1043/20)/0.9 = 58; instead it chose the serial tablescan at a cost of 1027 (cheaper than a serial index fast full scan because of the way I had defined a very large index).

Third plan – we supplied the parallel_index() hint with an explicit index_ffs() hint: both were legal so the optimizer obeyed the hints and produced a parallel index fast full scan (which is what we might have hoped would happen automatically for the second query). But the cost of the query is 1,043 – the cost of the serial index fast full scan.

Final plan – we didn’t hint an index_ffs() we hinted parallel() and parallel_index(): “strangely” Oracle has selected the parallel index fast full scan – and this time the cost is 58 (which, as I pointed out above, is (1043/20)/0.9, the value it’s “supposed” to be). It looks as if Oracle uses the parallel scaling factor on the index only if the table is ALSO hinted (or defined) to be treated to parallel execution.

Note, by the way, that I’ve hinted the table parallel 10 so that the cost of the parallel tablescan (ca. (1027/10)/0.9 = 114) will be larger than the cost of the parallel index fast full scan. Normally you’re likely to have the same degree hinted for table and index, and the index is likely to be smaller than the table which means that if you see this anomaly in production it’s more likely to be a case of Oracle using a parallel tablescan when you’re expecting a parallel index fast full scan. Oracle will have compared the serial cost of an index fast full scan against the parallel cost of the table.

Footnote:

When I ran the script on 12c there was one important change. The cost of the “correctly costed” index fast full scan came out at 103 rather than 58. Oracle has used the degree from the parallel() hint for the tablescan to calculate the cost of the parallel index fast full scan. If you are unlucky enough to have some code that has conflicting degrees hinted (or defined) across tables and indexes then you may see some plans change because parallel index fast full scans suddenly change their cost.

Addendum (prompted by Randolf Geist’s comment): The plan in 12c also had a Note: “Degree of Parallelism is 4 because of table property”. This was a little odd because the degree of the table was 1 and the hint said 10, but the arithmetic of the tablescan had ( as indicated) used a DOP of 4 – despite reporting Degree=0 in the 10053 trace file. The only way I could get the number 4 was to look at the CPU parameters – bearing in mind the critical parallel parameters:

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cpu_count                            integer     2
parallel_threads_per_cpu             integer     2

parallel_adaptive_multi_user         boolean     TRUE
parallel_automatic_tuning            boolean     FALSE
parallel_degree_limit                string      CPU
parallel_degree_policy               string      MANUAL
parallel_io_cap_enabled              boolean     FALSE

The anomaly of “degree 10” appearing in the costing for the index fast full scan was echoed in the 10053 trace: the file reported “Degree: 20”, then showed the cost being divided by 10.

At run-time, though, the query DID run parallel 20 when hinted with both the parallel() and parallel_index() hints, and ran parallel 10 (while costing and repeating the note about parallel 4) when modified to have only the /*+ parallel(t1 10) */ hint.

June 10, 2016

Uniquely parallel

Filed under: 12c,distributed,Execution plans,Oracle,Parallel Execution,Performance,Upgrades — Jonathan Lewis @ 7:36 am GMT Jun 10,2016

Here’s a surprising (to me) execution plan from 12.1.0.2 – parallel execution to find one row in a table using a unique scan of a unique index – produced by running the following script (data creation SQL to follow):


set serveroutput off
set linesize 180
set trimspool on
set pagesize 60

alter session set statistics_level = all;

variable b1 number
exec :b1 := 50000

select /*+ parallel (3) */ id, v1 from t2 where id=:b1;

select * from table(dbms_xplan.display_cursor(null,null,'allstats parallel'));

break on dfo_number skip 1 on tq_id skip 1 on server_type

select
        dfo_number, tq_id, server_type, instance, process, num_rows
from
        v$pq_tqstat
order by
        dfo_number, tq_id, server_type desc, instance, process
;

All I’ve done is enable rowsource execution statistics, set a bind variable to a value, query a table with a /*+ parallel(3) */ hint to find the one row that will be identified by primary key, and then reported the actual execution plan. When I first ran the test Oracle didn’t report the execution statistics correctly so I’ve also queried v$pq_tqstat to show the PX servers used and the flow of data through the plan. Here’s the plan, followed by the  results from v$pq_tqstat:


SQL_ID  0dzynh9d29pt9, child number 0
-------------------------------------
select /*+ parallel (3) */ id,v1 from t2 where id=:b1

Plan hash value: 247082613

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |          |      1 |        |        |      |            |      1 |00:00:00.02 |
|   1 |  PX COORDINATOR                   |          |      1 |        |        |      |            |      1 |00:00:00.02 |
|   2 |   PX SEND QC (RANDOM)             | :TQ10001 |      0 |      1 |  Q1,01 | P->S | QC (RAND)  |      0 |00:00:00.01 |
|   3 |    TABLE ACCESS BY INDEX ROWID    | T2       |      0 |      1 |  Q1,01 | PCWP |            |      0 |00:00:00.01 |
|   4 |     BUFFER SORT                   |          |      0 |        |  Q1,01 | PCWC |            |      0 |00:00:00.01 |
|   5 |      PX RECEIVE                   |          |      0 |      1 |  Q1,01 | PCWP |            |      0 |00:00:00.01 |
|   6 |       PX SEND HASH (BLOCK ADDRESS)| :TQ10000 |      0 |      1 |  Q1,00 | S->P | HASH (BLOCK|      0 |00:00:00.01 |
|   7 |        PX SELECTOR                |          |      0 |        |  Q1,00 | SCWC |            |      0 |00:00:00.01 |
|*  8 |         INDEX UNIQUE SCAN         | T2_PK    |      0 |      1 |  Q1,00 | SCWP |            |      0 |00:00:00.01 |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access("ID"=:B1)

Note
-----
   - Degree of Parallelism is 3 because of hint

DFO_NUMBER      TQ_ID SERVER_TYP   INSTANCE PROCES   NUM_ROWS
---------- ---------- ---------- ---------- ------ ----------
         1          0 Producer            1 P003            0
                                          1 P004            1
                                          1 P005            0
                      Consumer            1 P000            0
                                          1 P001            1
                                          1 P002            0

                    1 Producer            1 P000            0
                                          1 P001            1
                                          1 P002            0
                      Consumer            1 QC              1

As you can see the table access follows a unique scan of an index and, although the rowsource execution stats report zero starts for the unique scan, we can see from v$pq_tqstat that slave P004 acquired a “row” (actually a rowid) and passed it to slave P001 which then acquired a row from the table and passed that row to the query coordinator. Oracle really did execute a parallel query, starting and stopping a total of 6 sessions to perform a single unique index access.

You’ll notice operation 7 is one you’ve only seen in the latest version of Oracle. The PX SELECTOR was introduced in 12c to reduce the number of times a complex parallel query would funnel into the query coordinator (parallel to serial) and then fan out again (serial to parallel) generating a new data flow operation tree (DFO tree) spawning one or two new parallel server groups as it did so. To stop this happening a step that needs to serialise in a 12c parallel plan can nominate one of the existing PX server processes (from each set, if necessary) to do the job so that the same set of PX servers can carry on running the query without the need for a new DFO tree to appear.

This enhancement to parallel execution plans is a good idea – except when it appears in my silly little query and turns something that ought to be quick and cheap into a job that is far more resource-intensive than it should be.

At this point, of course, you’re probably wondering what kind of idiot would put a parallel() hint into a query that was doing nothing but selecting one row by primary key – the answer is: “the Oracle optimizer in 12c”. I discovered this anomaly while creating a demonstration of the way that a distributed parallel query has to serialise through a single database link even if the operations at the two ends of the link run parallel. Here’s the SQL I wrote for the full demonstration:


rem     Script:         distributed_pq.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2016

define m_remote='test@loopback'
define m_remote='orcl@loopback'

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        cast(rownum as number(8,0))                     id,
        cast(lpad(rownum,8,'0') as varchar2(8))         v1,
        cast(rpad('x',100) as varchar2(100))            padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5
;

create table t2
as
select  *
from    t1
where   mod(id,100) = 0
;

alter table t2 add constraint t2_pk primary key(id);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

set serveroutput off

select
        /*+ parallel(3) */
        t1.v1, t2.v1
from
        t1,
        t2@&m_remote
where
        mod(t1.id,10) = 0
and     t2.id = t1.id
and     mod(to_number(t2.v1),10) = 1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats parallel'));

If you want to run this demo you’ll need to do something about formatting the output; more importantly you’ll have to create a database link (with a loopback link) and set up a define identifying it at the line where I’ve got orcl@loopback and test@loopback (which are my 12c and 11g loopback links respectively).

Here’s the plan (with rowsource stats) I got from the 12c test:


----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |      1 |        |        |      |            |      0 |00:01:14.67 |       7 |      0 |
|   1 |  NESTED LOOPS         |          |      1 |     10 |        |      |            |      0 |00:01:14.67 |       7 |      0 |
|   2 |   PX COORDINATOR      |          |      1 |        |        |      |            |  10000 |00:00:00.11 |       7 |      0 |
|   3 |    PX SEND QC (RANDOM)| :TQ10000 |      0 |   1000 |  Q1,00 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |
|   4 |     PX BLOCK ITERATOR |          |      3 |   1000 |  Q1,00 | PCWC |            |  10000 |00:03:17.72 |    1745 |   1667 |
|*  5 |      TABLE ACCESS FULL| T1       |     39 |   1000 |  Q1,00 | PCWP |            |  10000 |00:00:00.06 |    1745 |   1667 |
|   6 |   REMOTE              | T2       |  10000 |      1 |        |      |            |      0 |00:01:14.44 |       0 |      0 |
----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access(:Z>=:Z AND :Z<=:Z)
       filter(MOD("T1"."ID",10)=0)

Remote SQL Information (identified by operation id):
----------------------------------------------------
   6 - SELECT /*+ SHARED (3) */ "ID","V1" FROM "T2" "T2" WHERE "ID"=:1 AND MOD(TO_NUMBER("V1"),10)=1
       (accessing 'ORCL@LOOPBACK' )

Note
-----
   - Degree of Parallelism is 3 because of hint

I have hacked this output a little – the “Remote SQL” section didn’t get reported by display_cursor(), so I’ve inserted the remote sql I got from a call to dbms_xplan.display() after using explain plan to generate a plan. Note the /*+ shared(3) */ hint that appears in the remote SQL – that’s the internal version of a parallel(3) hint.

In 11g the query complete in 2.4 seconds, in 12c the query took nearly 75 seconds to run thanks to the 12c enhancement that allowed it to obey the hint! Looking at the time column (and ignoring the anomalous 3:17 at operation 4 – which might roughly be echoing 3 * 1:14) we can see that the time goes on the calls to the remote database (and a check of v$session_event shows this time spent in “SQL*Net message from db link”), so the obvious thing to do is check what actually happened at the remote database and we can do that by searching the library cache for a recognizable piece of the remote SQL – here’s the SQL to do that, with the results from 11g followed by the results from 12c:


SQL> select sql_id, child_number, executions, px_servers_executions, sql_text from v$sql
  2  where sql_text like '%SHARED%' and sql_text not like 'select sql_id%';

11g results
SQL_ID        CHILD_NUMBER EXECUTIONS PX_SERVERS_EXECUTIONS
------------- ------------ ---------- ---------------------
SQL_TEXT
------------------------------------------------------------------------------------------------------------------------------------
c0f292z5czhwk            0      10000                     0
SELECT /*+ SHARED (3) */ "ID","V1" FROM "T2" "T2" WHERE MOD(TO_NUMBER("V1"),10)=1 AND "ID"=:1


12c results
SQL_ID        CHILD_NUMBER EXECUTIONS PX_SERVERS_EXECUTIONS
------------- ------------ ---------- ---------------------
SQL_TEXT
------------------------------------------------------------------------------------------------------------------------------------
7bk51w7vtagwd            0      10000                     0
SELECT /*+ SHARED (3) */ "ID","V1" FROM "T2" "T2" WHERE "ID"=:1 AND MOD(TO_NUMBER("V1"),10)=1

7bk51w7vtagwd            1          0                 59995
SELECT /*+ SHARED (3) */ "ID","V1" FROM "T2" "T2" WHERE "ID"=:1 AND MOD(TO_NUMBER("V1"),10)=1

It’s not surprising to see that the query has executed 10,000 times – that’s what we were told by the Starts statistic from dbms_output.display_cursor(), but 12c has 60,000 (with a little error) PX Servers executions of the statement. That’s 10,000 executions * degree 3 * the 2 slave sets we saw in my original execution plan. (It’s an odd little quirk of the two versions of Oracle that the order of predicates in the remote SQL was reversed between 11g and 12c – leading to two different SQL_IDs).

By enabling rowsource execution stats at the system level I was able to capture the remote execution plan with its stats:


SQL_ID  7bk51w7vtagwd, child number 0
-------------------------------------
SELECT /*+ SHARED (3) */ "ID","V1" FROM "T2" "T2" WHERE "ID"=:1 AND
MOD(TO_NUMBER("V1"),10)=1

--------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name     | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |          |      0 |        |      0 |00:00:00.01 |       0 |
|   1 |  PX COORDINATOR                   |          |      0 |        |      0 |00:00:00.01 |       0 |
|   2 |   PX SEND QC (RANDOM)             | :TQ10001 |      0 |      1 |      0 |00:00:00.01 |       0 |
|*  3 |    TABLE ACCESS BY INDEX ROWID    | T2       |  29983 |      1 |      0 |00:00:22.21 |    1000 |
|   4 |     BUFFER SORT                   |          |  29995 |        |    999 |00:00:21.78 |       0 |
|   5 |      PX RECEIVE                   |          |  29924 |      1 |    994 |00:00:21.21 |       0 |
|   6 |       PX SEND HASH (BLOCK ADDRESS)| :TQ10000 |      0 |      1 |      0 |00:00:00.01 |       0 |
|   7 |        PX SELECTOR                |          |  29993 |        |    999 |00:00:06.08 |   19992 |
|*  8 |         INDEX UNIQUE SCAN         | T2_PK    |  29999 |      1 |   1000 |00:00:00.24 |   20000 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(MOD(TO_NUMBER("V1"),10)=1)
   8 - access("ID"=:1)

Unlike the test case I started with, this output did show the number of starts (with a few missing) and the work done across the slaves. Our index probe had to do two buffer gets on every execution, and we have 10,000 executions of the query so 20,000 buffer gets on the index unique scan. Even though only one slave actually does any work with the PX Selector, all three slaves in that set seem to “start” the relevant operations. The definition of the data meant that only one index probe in 10 was successful so we only visited 1,000 buffers from the table. If you’re familiar with reading plans with rowsource execution stats you’ll appreciate that something has gone wrong in the reporting here – that 1,000 at operation 3 should read 21,000 because it ought to include the 20,000 from the index scan (at least, that’s what a serial plan would do).

If you’re still wondering why running this query as a parallel query should take so long – after all it’s only 10,000 executions in 70 seconds – bear in mind that Oracle has to allocate and deallocate 6 PX servers to new sessions each time it starts; the instance activity stats showed “logons cumulative” going up by 60,000 each time I ran the driving query: that’s about 850 logons (and log offs) per second. I don’t think my test machine would give a realistic impression of the impact of a couple of copies of this query running simultaneously, but when I tried the contention introduce increased the run time to 93 seconds.

tl;dr

Watch out for poor performance becoming disastrous for distributed parallel queries when you upgrade from 11g to 12c

 

 

.

May 11, 2016

dbms_xplan

Filed under: dbms_xplan,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 12:22 pm GMT May 11,2016

My favourite format options for dbms_xplan.display_cursor().

This is another of those posts where I tell you about something that I’ve frequently mentioned but never documented explicitly as a good (or, at least, convenient) idea. It also another example of how easy it is to tell half the story most of the time when someone asks a “simple” question.

You’re probably familiar with the idea of “tuning by cardinality feedback” – comparing the predicted data volumes with the actual data volumes from an execution plan – and I wrote a short note about how to make that comparison last week; and you’re probably familiar with making a call to dbms_xplan.display_cursor() after enabling the capture of rowsource execution statistics (in one of three ways) for the execution of the query, and the format parameter usually suggested for the call is ‘allstats last’ to get the execution stats for the most recent execution of the query. I actually like to see the Cost column of the execution plan as well, so I usually add that to the format, so (with all three strategies shown for an SQL*Plus environment):

set linesize 180
set trimspool on
set pagesize 60
set serveroutput off

alter session set "_rowsource_execution_statistics"=true;
alter session set statistics_level=all;

select /*+ gather_plan_statistics */ * from user_tablespaces;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));

So what do we often forget to mention:

  • For SQL*Plus it is important to ensure that serveroutput is off
  • The /*+ gather_plan_statistics */ option uses sampling, so may be a bit inaccurate
  • The two accurate strategies may add a significant, sometimes catastrophic, amount of CPU overhead
  • This isn’t appropriate if the query runs parallel

For a parallel query the “last” execution of a query is typically carried out by the query co-ordinator, so the rowsource execution stats of many (or all) of the parallel execution slaves are likely to disappear from the output. If you’re testing with parallel queries you need to add some “tag” text to the query to make it unique and omit the ‘last’ option from the format string.

Now, a common suggestion is that you need to add the ‘all’ format option instead – but this doesn’t mean “all executions” it means (though doesn’t actually deliver) all the data that’s available about the plan. So here’s an execution plans produced after running a parallel query and using ‘allstats all’ as the format option (t1 is a copy of all_objects, and this demo is running on 12.1.0.2).

SQL_ID  51u5j42rvnnfg, child number 1
-------------------------------------
select  /*+   parallel(2)  */  object_type,  sum(object_id) from t1
group by object_type order by object_type

Plan hash value: 2919148568

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |       |   113 (100)|          |        |      |            |     30 |00:00:00.04 |       5 |      0 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |       |            |          |        |      |            |     30 |00:00:00.04 |       5 |      0 |       |       |          |
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |      0 |     30 |   420 |   113   (9)| 00:00:01 |  Q1,01 | P->S | QC (ORDER) |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   3 |    SORT GROUP BY         |          |      2 |     30 |   420 |   113   (9)| 00:00:01 |  Q1,01 | PCWP |            |     30 |00:00:00.01 |       0 |      0 |  2048 |  2048 |     2/0/0|
|   4 |     PX RECEIVE           |          |      2 |     30 |   420 |   113   (9)| 00:00:01 |  Q1,01 | PCWP |            |     50 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      PX SEND RANGE       | :TQ10000 |      0 |     30 |   420 |   113   (9)| 00:00:01 |  Q1,00 | P->P | RANGE      |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       HASH GROUP BY      |          |      2 |     30 |   420 |   113   (9)| 00:00:01 |  Q1,00 | PCWP |            |     50 |00:00:00.05 |    1492 |   1440 |  1048K|  1048K|     2/0/0|
|   7 |        PX BLOCK ITERATOR |          |      2 |  85330 |  1166K|   105   (2)| 00:00:01 |  Q1,00 | PCWC |            |  85330 |00:00:00.03 |    1492 |   1440 |       |       |          |
|*  8 |         TABLE ACCESS FULL| T1       |     26 |  85330 |  1166K|   105   (2)| 00:00:01 |  Q1,00 | PCWP |            |  85330 |00:00:00.01 |    1492 |   1440 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$1
   8 - SEL$1 / T1@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access(:Z>=:Z AND :Z<=:Z)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - "OBJECT_TYPE"[VARCHAR2,23], SUM()[22]
   2 - (#keys=0) "OBJECT_TYPE"[VARCHAR2,23], SUM()[22]
   3 - (#keys=1; rowset=200) "OBJECT_TYPE"[VARCHAR2,23], SUM()[22]
   4 - (rowset=200) "OBJECT_TYPE"[VARCHAR2,23], SYS_OP_MSR()[25]
   5 - (#keys=1) "OBJECT_TYPE"[VARCHAR2,23], SYS_OP_MSR()[25]
   6 - (rowset=200) "OBJECT_TYPE"[VARCHAR2,23], SYS_OP_MSR()[25]
   7 - (rowset=200) "OBJECT_ID"[NUMBER,22], "OBJECT_TYPE"[VARCHAR2,23]
   8 - (rowset=200) "OBJECT_ID"[NUMBER,22], "OBJECT_TYPE"[VARCHAR2,23]

Note
-----
   - Degree of Parallelism is 2 because of hint


48 rows selected.

You’ll notice we’ve reported the “alias” and “projection” information – those are two of the format options that you can use with a + or – to include or exclude if you want. We’ve also got E-Bytes and E-time columns in the body of the plan. In other words (at least in my opinion) we’ve got extra information that makes the output longer and wider and therefore harder to read.

The format string I tend to use for parallel query is ‘allstats parallel cost’ – which (typically) gives something like the following:

SQL_ID  51u5j42rvnnfg, child number 1
-------------------------------------
select  /*+   parallel(2)  */  object_type,  sum(object_id) from t1
group by object_type order by object_type

Plan hash value: 2919148568

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows | Cost (%CPU)|    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |   113 (100)|        |      |            |     30 |00:00:00.04 |       5 |      0 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |            |        |      |            |     30 |00:00:00.04 |       5 |      0 |       |       |          |
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |      0 |     30 |   113   (9)|  Q1,01 | P->S | QC (ORDER) |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   3 |    SORT GROUP BY         |          |      2 |     30 |   113   (9)|  Q1,01 | PCWP |            |     30 |00:00:00.01 |       0 |      0 |  2048 |  2048 |     2/0/0|
|   4 |     PX RECEIVE           |          |      2 |     30 |   113   (9)|  Q1,01 | PCWP |            |     50 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      PX SEND RANGE       | :TQ10000 |      0 |     30 |   113   (9)|  Q1,00 | P->P | RANGE      |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       HASH GROUP BY      |          |      2 |     30 |   113   (9)|  Q1,00 | PCWP |            |     50 |00:00:00.05 |    1492 |   1440 |  1048K|  1048K|     2/0/0|
|   7 |        PX BLOCK ITERATOR |          |      2 |  85330 |   105   (2)|  Q1,00 | PCWC |            |  85330 |00:00:00.03 |    1492 |   1440 |       |       |          |
|*  8 |         TABLE ACCESS FULL| T1       |     26 |  85330 |   105   (2)|  Q1,00 | PCWP |            |  85330 |00:00:00.01 |    1492 |   1440 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access(:Z>=:Z AND :Z<=:Z)

Note
-----
   - Degree of Parallelism is 2 because of hint


30 rows selected.

Of course you may prefer ‘allstats all’ – and sometimes I do actually want to see the alias or projection information – but I think there’s so much information available on the execution plan output that anything that makes it a little shorter, cleaner and tidier is a good thing.

You might have noticed, by the way, that the Buffers, Reads, and A-Time columns have still managed to lose information on the way up from operation 6; information that should have been summing up the plan has simply disappeared.  Make sure you do a sanity check for disappearing numbers when you’re looking at more complex plans.

 

February 5, 2016

Parallel DML

Filed under: Execution plans,Oracle,Parallel Execution,Performance — Jonathan Lewis @ 1:02 pm GMT Feb 5,2016

A recent posting on OTN presented a performance anomaly when comparing a parallel “insert /*+ append */” with a parallel “create table as select”.  The CTAS statement took about 4 minutes, the insert about 45 minutes. Since the process of getting the data into the data blocks would be the same in both cases something was clearly not working properly. Following Occam’s razor, the first check had to be the execution plans – when two statements that “ought” to do the same amount of work take very different times it’s probably something to do with the execution plans – so here are the two statements with their plans:

First the insert, which took 45 minutes:

insert  /*+ append parallel(a,16) */ into    
        dg.tiz_irdm_g02_cc  a
select
        /*+ parallel (a,16) parallel (b,16) */ 
        *
from    tgarstg.tst_irdm_g02_f01 a, 
        tgarstg.tst_irdm_g02_f02 b
where   a.ip_id = b.ip_id
;

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |        |      |            |
|   2 |   PX COORDINATOR                 |                  |       |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->S | QC (RAND)  |
|*  4 |     HASH JOIN BUFFERED           |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |
|   5 |      PX RECEIVE                  |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   6 |       PX SEND HASH               | :TQ10000         |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   7 |        PX BLOCK ITERATOR         |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |      PX RECEIVE                  |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |       PX SEND HASH               | :TQ10001         |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  11 |        PX BLOCK ITERATOR         |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWC |            |
|  12 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------

And here’s the ‘create table’ at 4:00 minutes:

create table dg.tiz_irdm_g02_cc 
nologging 
parallel 16 
compress for query high 
as
select
        /*+ parallel (a,16) parallel (b,16) */ 
        *
from    tgarstg.tst_irdm_g02_f01 a , 
        tgarstg.tst_irdm_g02_f02 b 
where
        a.ip_id = b.ip_id

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | CREATE TABLE STATEMENT           |                  |    13M|    36G|       |   397K  (1)| 00:00:14 |        |      |            |
|   1 |  PX COORDINATOR                  |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10002         |    13M|    36G|       |   255K  (1)| 00:00:09 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,02 | PCWP |            |
|*  4 |     HASH JOIN                    |                  |    13M|    36G|  1842M|   255K  (1)| 00:00:09 |  Q1,02 | PCWP |            |
|   5 |      PX RECEIVE                  |                  |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   6 |       PX SEND HASH               | :TQ10000         |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   7 |        PX BLOCK ITERATOR         |                  |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       | 11465   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |      PX RECEIVE                  |                  |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,02 | PCWP |            |
|  10 |       PX SEND HASH               | :TQ10001         |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | P->P | HASH       |
|  11 |        PX BLOCK ITERATOR         |                  |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | PCWC |            |
|  12 |         TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 36706   (3)| 00:00:02 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------

As you can see, the statements are supposed to operate with degree of parallelism 16, and we were assured that the pre-existing table had been declared as nologging with the same level of compression as that given in the CTAS so, assuming the queries did run with the degree expected, they should take virtually the same amount of time.

But there’s an important clue in the plan about why there was a difference, and why the difference could be so great. The first statement is DML, the second is DDL. Parallel DDL is automatically enabled, parallel DML has to be enabled explicitly otherwise the select will run in parallel but the insert will be serialized. Look at operations 1 – 4 of the insert – the query co-ordinator does the “load as select” of the rowsource sent to it by the parallel execution slaves. Not only does this mean that one process (rather than 16) does the insert, you also have all the extra time for all the messaging and the hash join (at line 4) has to be buffered – which means a HUGE amount of data could have been dumped to disc by each slave prior to the join actually taking place and then been read back from disc, joined, and forwarded.

Note that the hash join in the CTAS is not buffered – each slave does the join as the data arrives and writes the result directly to its local segment. Basically the insert could be doing something like twice the I/O of the CTAS (and this is Exadata, so reads from temp can be MUCH slower than the tablescans that supply the data to be joined).

So the OP checked, and found that (although he thought he had enabled parallel DML) he hadn’t actually done so. And after enabling parallel DML the timing was … just as bad. Ooops!! Something else must have gone wrong. Here’s the plan after enabling parallel DML:


--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                   |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  PX COORDINATOR                    |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10003         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,03 | PCWP |            |
|   4 |     PX RECEIVE                     |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,03 | PCWP |            |
|   5 |      PX SEND RANDOM LOCAL          | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->P | RANDOM LOCA|
|*  6 |       HASH JOIN BUFFERED           |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE                  |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HASH               | :TQ10000         |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR         |                  |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F02 |    13M|    14G|       |  5732   (5)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |        PX RECEIVE                  |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,02 | PCWP |            |
|  12 |         PX SEND HASH               | :TQ10001         |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  13 |          PX BLOCK ITERATOR         |                  |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWC |            |
|  14 |           TABLE ACCESS STORAGE FULL| TST_IRDM_G02_F01 |    13M|    21G|       | 18353   (3)| 00:00:01 |  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------------------------

As you can see, line 3 has the LOAD AS SELECT after which the slaves message the query co-ordinator – so the DML certainly was parallel even though it wasn’t any faster. But why is the hash join (line 6) still buffered, and why is there an extra data flow (lines 5 and 4 – PX SEND RANDOM LOCAL / PX RECEIVE). The hash join has to be buffered because of that extra data flow (which suggests that the buffering and messaging could still be the big problem) – but WHY is the data flow there at all, it shouldn’t be.

At this point I remembered that the first message in the thread had mentioned testing partitioned tables as well as non-partitioned tables – and if you do a parallel insert to a partitioned table and the data is going to be spread across several partitions, and the number of partitions is not a good match for the degree of parallelism then you’re likely to an extra stage of data distribution as Oracle tries to share the data and the partitions as efficiently as possible across slaves. One of the possible distribution methods is “local random” – which is fairly likely to appear if the number of slaves is larger than the number of partitions. This behaviour can be modified with the newer “single distribution” version of the pq_distribute hint. So I asked the OP if their latest test was on a partitioned table, and suggested they insert the hint /*+ pq_distribute(a none) */ just after the parallel hint.

The answer was yes, and the hint had the effect of dropping the run time down to 7 minutes – still not as good as the CTAS, but then the CTAS wasn’t creating a partitioned table so it’s still not a completely fair test. Here’s the (start of the) final plan:

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name             | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                   |                  |    13M|    36G|       |   127K  (1)| 00:00:05 |        |      |            |
|   1 |  PX COORDINATOR                    |                  |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10002         |    13M|    36G|       |   127K  (1)| 00:00:05 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT                  | TIZ_IRDM_G02_CC  |       |       |       |            |          |  Q1,02 | PCWP |            |
|*  4 |     HASH JOIN                      |                  |    13M|    36G|   921M|   127K  (1)| 00:00:05 |  Q1,02 | PCWP |            |

As you can see, we have a hash join that is NOT buffered; we don’t have a third distribution, and the slaves do the data load and then message the query co-ordinator.

It would be interesting to know if there was a significant skew in the data volumes that went into each partition of the partitioned table, and check where the time was spent for both the partitioned insert and the non-partitioned CTAS (and compare with a non-partitioned insert) – but real-world DBAs don’t necessarily have all the time for investigations that I do.

My reference: parallel_dml.sql

December 21, 2015

Parallel Plans

Filed under: 12c,Oracle,Parallel Execution — Jonathan Lewis @ 8:40 am GMT Dec 21,2015

This article was prompted by a pair of articles by Yasin Baskan of Oracle Corporation: PX Server Sets etc. and Multiple Parallelizers, plus a little extra prompting from a mistake that I made when reading the second of those two articles. The fact that I made a mistake is significant because, without it, I wouldn’t have created a model to check Yasin’s description of the parallel activity.

I want to examine the following query to find out the order of activity:


select
        distinct t1.n1
from
        t1
where 
        t1.n1 >= 500
and     t1.n2 > (select avg(t2.n2) from t2)
and     t1.n3 > (select avg(t3.n3) from t3)
;

I’m going to be using 12.1.0.2 for this demonstration (lots of changes appear in parallel execution as you move from 11g to 12c, so the version is most significant), with all default settings for the parallel-related parameters.

To get things started, here’s the code to generate the data that I used, and the serial execution plan:


create table t1 nologging as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum,1000)        n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4
;

create table t2 nologging as
with generator as ( 
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4
;

create table t3 nologging as
with generator as ( 
        select  --+ materialize
                rownum id
        from dual 
        connect by 
                level <= 1e4
)
select
        rownum                  n1,
        rownum                  n2,
        rownum                  n3,
        lpad(rownum,10)         small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 2e4 ;

-- don't need to gather stats as they are generated on CTAS in 12c.
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |       |       |   150 (100)|          |
|   1 |  HASH UNIQUE         |      |    24 |   336 |   150   (4)| 00:00:01 |
|*  2 |   TABLE ACCESS FULL  | T1   |    25 |   350 |    51   (4)| 00:00:01 |
|   3 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   4 |     TABLE ACCESS FULL| T2   | 20000 |    97K|    49   (3)| 00:00:01 |
|   5 |    SORT AGGREGATE    |      |     1 |     5 |            |          |
|   6 |     TABLE ACCESS FULL| T3   | 20000 |    97K|    49   (3)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("T1"."N1">=500 AND "T1"."N2"> AND "T1"."N3">))

I’ve pulled the execution plan from memory using dbms_xplan.display_cursor(), which is why the predicate section shows the odd comparison of “t2.n2 > {nothing there}” and “t3.n3 > {nothing there}”. The predicate section produced by a simple call to explain plan would help us to fill in the details:


Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("T1"."N1">=500 AND "T1"."N2"> (SELECT AVG("T2"."N2") FROM
              "T2" "T2") AND "T1"."N3"> (SELECT AVG("T3"."N3") FROM "T3" "T3"))

We can infer from the predicate section that Oracle executes the t2 subquery first and the t3 subquery second; but there is a slight oddity in the shape of the execution plan if we apply the “first child first / recursive descent” approach to reading the plan. It looks as if operations 3 and 5 are children of line 2 and should be run before child 2. This is the consequence of “subquery pushing”. In fact the subqueries are both filter subqueries and, in principle (and ignoring scalar subquery caching for the moment), they will be run once for each row produced from the tablescan of operation 2.

The activity is as follows:

  • start scanning table t1
  • when the first row appears where n1 >= 500 run the subquery against t2 to derive the average of t2.n2 (10000.5) and check to see if the n2 value of the current t1 row exceeds that value (it doesn’t);
  • continue scanning t1 checking for rows where n1 >= 500 and n2 >= 10000.5 (no need to rerun the subquery thanks to scalar subquery caching),
  • on finding the first row matching both predicates run the subquery against t3 to derive the average of t3.n3 (10000.5) and check to see if the n3 value of the current t1 row exceeds that value.
  • continue the scan of t1 checking for rows where n1 >= 500 and n2 >= 10000.5 and n3 >= 10000.5 (again scalar subquery caching means no further executions of either subquery).

We can add weight to the argument that this is the order of activity by flushing the buffer cache and enabled the extended trace before running the query.  Here’s a section of a trace file that demonstrates the point – with no edits beyond the insertion of a few blank lines:


PARSE #140345652175312:c=1000,e=1813,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=115251305,tim=134216063372
EXEC #140345652175312:c=0,e=62,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=115251305,tim=134216063544
WAIT #140345652175312: nam='SQL*Net message to client' ela= 3 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=134216063569
WAIT #140345652175312: nam='db file sequential read' ela= 9 file#=5 block#=131 blocks=1 obj#=133974 tim=134216063627
WAIT #140345652175312: nam='db file scattered read' ela= 315 file#=5 block#=132 blocks=112 obj#=133974 tim=134216064187

WAIT #140345652175312: nam='db file sequential read' ela= 9 file#=5 block#=515 blocks=1 obj#=133975 tim=134216064438
WAIT #140345652175312: nam='db file scattered read' ela= 115 file#=5 block#=516 blocks=112 obj#=133975 tim=134216064732
WAIT #140345652175312: nam='db file scattered read' ela= 18 file#=5 block#=628 blocks=12 obj#=133975 tim=134216066046
WAIT #140345652175312: nam='db file scattered read' ela= 123 file#=5 block#=642 blocks=112 obj#=133975 tim=134216066548
WAIT #140345652175312: nam='db file scattered read' ela= 20 file#=5 block#=754 blocks=14 obj#=133975 tim=134216067760
WAIT #140345652175312: nam='db file scattered read' ela= 124 file#=5 block#=770 blocks=112 obj#=133975 tim=134216068153
WAIT #140345652175312: nam='db file scattered read' ela= 8 file#=5 block#=882 blocks=2 obj#=133975 tim=134216069677

WAIT #140345652175312: nam='db file scattered read' ela= 18 file#=5 block#=244 blocks=12 obj#=133974 tim=134216070049
WAIT #140345652175312: nam='db file scattered read' ela= 127 file#=5 block#=258 blocks=112 obj#=133974 tim=134216070382

WAIT #140345652175312: nam='db file sequential read' ela= 6 file#=5 block#=899 blocks=1 obj#=133976 tim=134216070812
WAIT #140345652175312: nam='db file scattered read' ela= 41 file#=5 block#=900 blocks=112 obj#=133976 tim=134216071028
WAIT #140345652175312: nam='db file scattered read' ela= 42 file#=5 block#=1012 blocks=12 obj#=133976 tim=134216072956
WAIT #140345652175312: nam='db file scattered read' ela= 215 file#=5 block#=1026 blocks=112 obj#=133976 tim=134216073642
WAIT #140345652175312: nam='db file scattered read' ela= 29 file#=5 block#=1138 blocks=14 obj#=133976 tim=134216076334
WAIT #140345652175312: nam='db file scattered read' ela= 154 file#=5 block#=1154 blocks=112 obj#=133976 tim=134216076825
WAIT #140345652175312: nam='db file scattered read' ela= 14 file#=5 block#=1266 blocks=2 obj#=133976 tim=134216081673

WAIT #140345652175312: nam='db file scattered read' ela= 24 file#=5 block#=370 blocks=14 obj#=133974 tim=134216082815
WAIT #140345652175312: nam='db file scattered read' ela= 144 file#=5 block#=386 blocks=112 obj#=133974 tim=134216083232
WAIT #140345652175312: nam='db file scattered read' ela= 20 file#=5 block#=498 blocks=12 obj#=133974 tim=134216084494

FETCH #140345652175312:c=16998,e=21096,p=1105,cr=1111,cu=0,mis=0,r=1,dep=0,og=1,plh=115251305,tim=134216084683

Object 133974 is table t1, object 133975 is table t2, and object 133976 is table t3.

As you can see, we start by scanning t1, then we scan t2 once, then we scan more of t1, then we scan t3 once, then we finish off scanning t1.

We could chase the detail a little further, of course – we could check the session statistics to make sure that we haven’t been using the buffer cache to scan t2 and t3 repeatedly, or we could enable rowsource execution statistics to report the number of times each operation started, or we could check the STAT# lines dumped a little later in the trace files and infer that the t2 and t3 tablescans happened just once each.

At this point I’ll mention my mistake: when I originally read Yasin’s version of the parallel equivalent of this plan my first thought was that the subquery sections of the plan operate from the bottom upwards (i.e. another one of those examples where “first child first” doesn’t apply) whereas Yasin described them as running top down. I was wrong – I was thinking of the case where the optimizer analyses the query from the bottom up, but it still produces a plan that then runs from the top down – but I didn’t realize my mistake until after I’d been poking around with the parallel version of the query and had made a surprising discovery.

Parallel Execution

So what happens when we manipulate this query into parallelism ? To make it a little easier to see some of the details I’ve added hints to each query block to use a different degree of parallelism; and then, because it all happens so fast, I’ve introduced a function call to slow down the progress of the query. Here’s the function I’ve created (you’ll find it elsewhere on blog) and the final query I started testing with:


create or replace function wait_row (
        i_secs  number,
        i_return        number
) return number
parallel_enable
is
begin
        dbms_lock.sleep(i_secs);
        return i_return;
end;
/


select
        /*+ parallel(t1,5) tracking */ 
        distinct t1.n1
from
        t1
where   
        wait_row(0.01,t1.n2) > (select /*+ parallel(t2,3) */ avg(wait_row(0.01,t2.n2)) from t2)
and     wait_row(0.01,t1.n3) > (select /*+ parallel(t3,2) */ avg(wait_row(0.01,t3.n3)) from t3)
and     t1.n1 >= 1000
--      and     wait_row(0.01, t1.n1) >= 500
;

The wait_row() function simply waits for the number of seconds given in the first parameter then returns the (numeric) value of the second parameter. It calls the dbms_lock.sleep() function which has a granularity of 1/100th of a second, and since I’m planning to use it in a parallel query I’ve included the parallel_enable declaration.

Since the function has not been declared deterministic, or assigned to the pl/sql result cache, and since there are 20,000 rows in the t2 table the time to run the t2 subquery will be roughly 20,000 * 0.01 seconds, which (ignoring parallelism) would be about 200 seconds. I’ve actually hinted DOP = 3 for that subquery so I hope it will complete in about 70 seconds, and the subquery against t3 with hinted DOP = 2 should complete in about 100 seconds.

You’ll notice I have two possible predicates against t1 – the one shown uncommented here doesn’t use call the wait_row() function, the other does. Here’s an interesting point, though, about the example that doesn’t call wait_row() – the simple predicate is going to eliminate ALL the data from t1, there are no rows where n1 >= 1000 and that’s an important point to remember.

Here’s the execution plan:


------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |       |       |    57 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)          | :TQ30001 |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | P->S | QC (RAND)  |
|   3 |    HASH UNIQUE                 |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | PCWP |            |
|   4 |     PX RECEIVE                 |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,01 | PCWP |            |
|   5 |      PX SEND HASH              | :TQ30000 |     1 |    14 |    57   (4)| 00:00:01 |  Q3,00 | P->P | HASH       |
|   6 |       HASH UNIQUE              |          |     1 |    14 |    57   (4)| 00:00:01 |  Q3,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR       |          |     1 |    14 |    11   (0)| 00:00:01 |  Q3,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL      | T1       |     1 |    14 |    11   (0)| 00:00:01 |  Q3,00 | PCWP |            |
|   9 |          SORT AGGREGATE        |          |     1 |     5 |            |          |  Q3,00 | PCWP |            |
|  10 |           PX COORDINATOR       |          |       |       |            |          |        |      |            |
|  11 |            PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|  12 |             SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|  13 |              PX BLOCK ITERATOR |          | 20000 |    97K|    18   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|* 14 |               TABLE ACCESS FULL| T2       | 20000 |    97K|    18   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  15 |          SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|  16 |           PX COORDINATOR       |          |       |       |            |          |        |      |            |
|  17 |            PX SEND QC (RANDOM) | :TQ20000 |     1 |     5 |            |          |  Q2,00 | P->S | QC (RAND)  |
|  18 |             SORT AGGREGATE     |          |     1 |     5 |            |          |  Q2,00 | PCWP |            |
|  19 |              PX BLOCK ITERATOR |          | 20000 |    97K|    27   (0)| 00:00:01 |  Q2,00 | PCWC |            |
|* 20 |               TABLE ACCESS FULL| T3       | 20000 |    97K|    27   (0)| 00:00:01 |  Q2,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   8 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N1">=1000 AND "WAIT_ROW"(.01,"T1"."N2")> AND "WAIT_ROW"(.01,"T1"."N3")>))
  14 - access(:Z>=:Z AND :Z<=:Z) 20 - access(:Z>=:Z AND :Z<=:Z)

The odd “:Z>=:Z AND :Z <=:Z” predicate is an artefact of rowid-range scans in parallel tablescans, and we’ve seen the “disappearing subquery” problem with predicates pulled from memory so that’s ignorable. The interesting point is that the “T1.N1 >= 1000” predicate will eliminate all the data from t1 very quickly – so why did it take roughly 3 minutes and 30 seconds to run the query ?

Quick Answer – Oracle ran the two subqueries before it started the t1 tablescan !

Slow Answer – I need to start by filling in a lot of background information.

Notice how we have multiple DFO Trees in this execution plan. In the TQ column these are captured as (Q1,xx) (Q2,xx) and (Q3,xx).  In previous blogs I’ve explained how you can follow the order of activity by following the table queues (effectively the “xx” – also known as virtual tables, also known as Data Flow Operations) in order, but I haven’t explained how you deal with multiple DFO trees. It would be nice to think that DFO Trees implicitly told you the order of operation – unfortunately that’s not true (at least in current versions of Oracle), and this query demonstrates the point.

Table Queues, DFOs (data flow operations), and DFO Trees

The first of Yasin Baskan’s two articles describes the terminology to use when describing the movement of data through a parallel execution plan, but for the sake of familiarity I’ll walk through the plan above using the appropriate terms.  (Note: once upon a time the term “DFO tree” didn’t exist and the term DFO was used to describe what is now called a DFO tree, and a number of articles I’ve written in the past echo this out-dated usage).

Lines 1 – 8 show a DFO tree consisting of two DFOs – the tree is labelled as Q3. The first DFO in the tree (DFO 0) is described in lines 5 – 8 where one set of parallel execution slaves scans real table t1 and sends a data set to the other set parallel execution slaves by writing into a virtual table (:TQ30000); the second DFO in the tree (DFO 1) is described in lines 1 – 4 where a set of parallel execution slaves reads from virtual table :TQ30000, does some work and sends a data set to the query co-ordinator by writing to virtual table :TQ30001.

Lines 9 – 14 show a second DFO tree consisting of a single DFO – this tree is labelled Q1;

Lines 15 – 20 show us a third DFO tree, labelled Q2, also consisting of a single DFO.

Essentially (as described by Yasin Baskan) the focal point of a single DFO (data flow operation) is a PX SEND, and the top of a DFO tree is a PX Coordinator.

The key to performance problems and interpreting tricky parallel execution plans is the fact that in the extreme case every DFO tree may operate at a different degree of parallelism, and all the DFO trees may be operating concurrently – an unlucky execution plan may acquire far more parallel execution slaves than you expect.

So what’s the order of execution in this query, how do we find out, how many parallel query slaves will we start (and stop) as the query runs, and what’s the maximum number of parallel query slaves that we will be holding concurrently ? There are lots of places you could look:

  • v$pq_tqstat might give you some idea of processes used and data moved – after you’ve run the query, but it’s a bit broken for multiple DFO trees.
  • v$px_sesstat will give you some information about the work done by currently live slave processes if you can query it while your query is still running (“set pause on” may help).
  • v$px_session joined to v$session_event will give you some information about the time lost to waits for currently live slave processes if you can query them while your query is still running (it would be nice if Oracle were to introduce a “v$px_sesevent”)
  • The SQL Monitor screen in OEM (or the dbms_sql_tune.report_sql_monitor() procedure is fantastic (but a little broken) if you’ve paid the licences for the diagnostic and performance packs.

Here’s part of the text output from a call to dbms_sql_tune.report_sql_monitor() for a slightly different query – one that doesn’t include any predicate at all on the t1 table:


SQL Plan Monitoring Details (Plan Hash Value=3828285674)
==================================================================================================================================================
| Id |            Operation             |   Name   |  Rows   | Cost |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                                  |          | (Estim) |      | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT                 |          |         |      |         1 |   +256 |    11 |     1000 |       |          |                 |
|  1 |   PX COORDINATOR                 |          |         |      |         1 |   +256 |    11 |     1000 |       |          |                 |
|  2 |    PX SEND QC (RANDOM)           | :TQ30001 |      49 |   60 |         1 |   +256 |     5 |     1000 |       |          |                 |
|  3 |     HASH UNIQUE                  |          |      49 |   60 |         1 |   +256 |     5 |     1000 |    7M |          |                 |
|  4 |      PX RECEIVE                  |          |      49 |   60 |         1 |   +256 |     5 |     4506 |       |          |                 |
|  5 |       PX SEND HASH               | :TQ30000 |      49 |   60 |         1 |   +256 |     5 |     4506 |       |          |                 |
|  6 |        HASH UNIQUE               |          |      49 |   60 |        68 |   +189 |     5 |     4506 |    7M |          |                 |
|  7 |         PX BLOCK ITERATOR        |          |      50 |   14 |        68 |   +189 |     5 |    10000 |       |          |                 |
|  8 |          TABLE ACCESS FULL       | T1       |      50 |   14 |        68 |   +189 |    63 |    10000 |       |          |                 |
|  9 |           SORT AGGREGATE         |          |       1 |      |         1 |   +187 |     1 |        1 |       |          |                 |
| 10 |            PX COORDINATOR        |          |         |      |         1 |   +187 |     4 |        3 |       |          |                 |
| 11 |             PX SEND QC (RANDOM)  | :TQ10000 |       1 |      |         3 |   +185 |     3 |        3 |       |          |                 |
| 12 |              SORT AGGREGATE      |          |       1 |      |        74 |   +114 |     3 |        3 |       |          |                 |
| 13 |               PX BLOCK ITERATOR  |          |   20000 |   18 |        74 |   +114 |     3 |    20000 |       |          |                 |
| 14 |                TABLE ACCESS FULL | T2       |   20000 |   18 |        74 |   +114 |    37 |    20000 |       |          |                 |
| 15 |           SORT AGGREGATE         |          |       1 |      |         1 |   +118 |     1 |        1 |       |          |                 |
| 16 |            PX COORDINATOR        |          |         |      |         1 |   +118 |     3 |        2 |       |          |                 |
| 17 |             PX SEND QC (RANDOM)  | :TQ20000 |       1 |      |         2 |   +111 |     2 |        2 |       |          |                 |
| 18 |              SORT AGGREGATE      |          |       1 |      |       111 |     +2 |     2 |        2 |       |          |                 |
| 19 |               PX BLOCK ITERATOR  |          |   20000 |   27 |       111 |     +2 |     2 |    20000 |       |          |                 |
| 20 |                TABLE ACCESS FULL | T3       |   20000 |   27 |       111 |     +2 |    26 |    20000 |       |          |                 |
==================================================================================================================================================

There’s a little inaccuracy in timings – in the order of the odd second or two – probably because of the granularity used, but the columns “Start Active” and “Time Active (s)” are, nevertheless, very revealing. The first one tells you when a line in the plan first did some work, the second one tells you how long that line has been active. As I walk through the plan remember that the timing granularity is fairly crude, so the numbers don’t quite add up as nicely as we might like.

If you check lines 17 – 20 you can see that DFO tree 2 (:TQ2xxxx) was the first thing to go active with its PX slaves starting at time t = 2 seconds and running for 111 seconds to do that scan and initial aggregation; the PX send line started at time t = 111 seconds and ran for 2 seconds to pass the data from the slaves to the coordinator.

There’s a little glitch (or hand waving moment) around lines15 and 16 where the PX Coordinator receives and aggregates data from the PX slaves – the co-ordinator seems to start doing this several seconds later than it should.

In lines 11 – 14 you can see that DFO tree 1 (:TQ1xxxx) was the second thing to go active with its PX slaves starting at time t = 114 (i.e. just after DFO tree 2 completes);  the full scan and initial aggregate ran for 74 seconds (taking us to time t = 188). The PX send started (a little early, this time) at t = 185 and ran for 3 seconds, with the co-ordinator starting at t = 187 and taking one second to receive and do the final aggregation.

In lines 2 – 8 we can see DFO tree 3 (:TQ3xxxx) starting up, but this DFO tree includes two DFOs. One set of slaves scans table t1 applying the predicates to filter the data then does the initial hash distinct before hash distributing the intermediate result to the next set of slaves that finish off the distinct aggregation.

When DFO :TQ30000 starts at t = 189, the previous DFO trees have completed and the PX coordinator has the results of the two subqueries which it passes to the slave set which can now scan and check for: “n1 >= 500 and  n2 > {derived constant} and n3 > {derived constant}”. The scan and initial aggregation takes 68 seconds completing at t = 255, and at t = 256 the results are distributed to the second set of PX slaves. The second set of slaves has virtually nothing to do and reports only 1 second of execution time before passing the data (DFO :TQ30001) to the query coordinator which does the final aggregation and report.

A key point to note is that this timing information tells us that (a) the subqueries both completed before the tablescan of t1 started, and – perhaps more importantly – that we cannot rely on the TQXnnnn numbers to tell us the order of operation of the DFO trees. In this example DFO tree 2 ran first, DFO tree 1 ran second, and DFO tree 3 ran third – some queries might happen to show the names and the order in synch, that would be a coincidence, not a deliberate design feature.

A little reminder, though – even though we cannot rely on the DFO tree names to tell us the ordering in which the DFO trees operate, when we look at the data flow operations inside the trees (i.e. the last few digits of a TQ name) the order of operation withing a given tree matches the sequence indicated by the TQ name.

Counting Slaves.

There’s more in the Monitor report that can help us understand the workload – and in this case I’ve got a lucky coincidence to help, and a little surprise to go with it.  For parallel queries the report includes a summary of parallel execution activity, and this is what I got from my example:

Parallel Execution Details (DOP=5 , Servers Allocated=15)
===========================================================================================================================
|      Name      | Type  | Group# | Server# | Elapsed |   Cpu   | Concurrency | PL/SQL  |  Other   | Buffer | Wait Events |
|                |       |        |         | Time(s) | Time(s) |  Waits(s)   | Time(s) | Waits(s) |  Gets  | (sample #)  |
===========================================================================================================================
| PX Coordinator | QC    |        |         |    0.03 |    0.02 |             |         |     0.02 |     15 |             |
| p00a           | Set 1 |      1 |       1 |    0.27 |    0.27 |             |    0.10 |          |    163 |             |
| p00b           | Set 1 |      1 |       2 |    0.25 |    0.25 |             |    0.09 |          |    156 |             |
| p00c           | Set 1 |      1 |       3 |    0.26 |    0.26 |             |    0.09 |          |    156 |             |

| p00a           | Set 1 |      2 |       1 |    0.43 |    0.43 |             |    0.15 |          |    221 |             |
| p00b           | Set 1 |      2 |       2 |    0.44 |    0.44 |             |    0.15 |          |    221 |             |

| p000           | Set 1 |      3 |       1 |    0.00 |    0.00 |             |         |          |        |             |
| p001           | Set 1 |      3 |       2 |    0.00 |    0.00 |             |         |          |        |             |
| p002           | Set 1 |      3 |       3 |    0.01 |    0.01 |             |         |          |        |             |
| p003           | Set 1 |      3 |       4 |    0.01 |    0.01 |             |         |          |        |             |
| p004           | Set 1 |      3 |       5 |    0.01 |    0.01 |             |         |          |        |             |
| p005           | Set 2 |      3 |       1 |    0.26 |    0.26 |             |    0.09 |          |    113 |             |
| p006           | Set 2 |      3 |       2 |    0.24 |    0.24 |        0.00 |    0.08 |          |    108 |             |
| p007           | Set 2 |      3 |       3 |    0.25 |    0.25 |             |    0.09 |          |    117 |             |
| p008           | Set 2 |      3 |       4 |    0.25 |    0.25 |             |    0.09 |          |    108 |             |
| p009           | Set 2 |      3 |       5 |    0.25 |    0.25 |        0.00 |    0.09 |          |    117 |             |
===========================================================================================================================

I’ve inserted two blank lines breaking the summary down into three separate groups, which you can identify by the heading “Group#”. In this table we see Group 1 has one slave set of three slaves – which corresponds to the tablescan of t2; Group 2 consists of one slave set of two slaves – which corresponds to the tablescan of t3; and Group 1 has two slave sets of 5 slaves each – which correspond to the tablescan and aggregation of t1. The Group numbers appear to align correctly with the DFO tree numbers.

Another detail that stands out from this list is that slaves p00a and p00b are used in the tablescan of t3, and in the tablescan of t2 (where they are joined by slave p00c). In this example Oracle has reused the slaves from one subquery to run the next. However we can also see that the slaves p000p009 that are used for the tablescan of t1 don’t have names that overlap with the slaves used to scan t2 and t3 – which tells use that we have some (in this case two) DFO trees running concurrently.

Another key point in this article is that not only does a DOP of N mean we could run 2*N slaves concurrently in a parallel query, if we have a query that breaks into a plan that uses multiple DFO trees we might 2 * N * {number of DFO trees) allocated and running concurrently. (And, as we see here, it’s even possible that different DFO trees run with different DOPs, leaving Oracle very confused about what to report as the DOP of the query – dbms_xplan actually reported this one as DOP = 4 !)

A final implementation detail that is not necessarily obvious from this table, but which I can infer because I know the history of what was going on: although, as described above, the tablescan of t3 was the first activity that generated a row source, Oracle started up the 10 slaves for the parallel tablescan and aggregate of t1 before it started the two slaves it needed to scan t3. I can infer this because there were no slave processes active when I started running the query, so it is reasonable to assume that the slave names tell me something about the order in which they were allocated. In general you would not be able to notice this in a busy system where slaves were pre-allocated (parallel_min_servers != 0) and constantly being acquired and released. (I was also able to corroborate some of the details above by looking at v$px_stat and v$session_event for the relevant slave processes shortly after the query started.)

Summary

There are a number of key points this example shows us about complex parallel queries:

  • A single query can produce multiple “DFO trees”
  • Each tree can run at a different degree of parallelism
  • The “tree number” (the X in TQXnnnn) may not reveal the order in which the trees are operated
  • The order of operation of the DFOs within a tree is revealed by the nnnn in TQXnnnn
  • It is not easy to see in a basic execution plan which DFO tree might be executing concurrently and which consecutively
  • PX server processes may be acquired by the query co-ordinator a long time before they actually become active

Footnote:

If you happen to have looked closely at the time information in the Parallel Execution Details you’ll note that the times are total rubbish when compared to the monitoring times. It looks as if the time spent in dbms_lock.sleep() has been ignored (perhaps as an “idle” wait).

 

July 15, 2015

PQ Index anomaly

Filed under: Indexing,Oracle,Parallel Execution — Jonathan Lewis @ 8:42 am GMT Jul 15,2015

Here’s an oddity prompted by a question that appeared on Oracle-L last night. The question was basically – “Why can’t I build an index in parallel when it’s single column with most of the rows set to null and only a couple of values for the non-null entries”.

That’s an interesting question, since the description of the index shouldn’t produce any reason for anything to go wrong, so I spent a few minutes on trying to emulate the problem. I created a table with 10M rows and a column that was 3% ‘Y’ and 0.1% ‘N’, then created and dropped an index in parallel in parallel a few times. The report I used to prove that the index build had run  parallel build showed an interesting waste of resources. Here’s the code to build the table and index:


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        case
                when mod(rownum,100) < 3 then 'Y'
                when mod(rownum,1000) = 7 then 'N'
        end                     flag,
        rownum                  id,
        rpad('x',30)            padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7
;

-- gather stats here

explain plan for
create index t1_i1 on t1(flag) parallel 4 nologging
;

select * from table(dbms_xplan.display);

create index t1_i1 on t1(flag) parallel 4 nologging;

select index_name, degree, leaf_blocks, num_rows from user_indexes;
alter index t1_i1 noparallel;

As you can see, I’ve used explain plan to get Oracle’s prediction of the cost and size, then I’ve created the index, then checked its size (and set it back to serial from its parallel setting). Here are the results of the various queries (from 11.2.0.4) – it’s interesting to note that Oracle thinks there will be 10M index entries when we know that “completely null entries don’t go into the index”:

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------
|   0 | CREATE INDEX STATEMENT   |          |    10M|    19M|  3073   (3)| 00:00:16 |        |      |            |
|   1 |  PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |    10M|    19M|            |          |  Q1,01 | P->S | QC (ORDER) |
|   3 |    INDEX BUILD NON UNIQUE| T1_I1    |       |       |            |          |  Q1,01 | PCWP |            |
|   4 |     SORT CREATE INDEX    |          |    10M|    19M|            |          |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE          |          |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANGE      | :TQ10000 |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | P->P | RANGE      |
|   7 |        PX BLOCK ITERATOR |          |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL| T1       |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------

Note
-----
   - estimated index size: 243M bytes

INDEX_NAME           DEGREE                                   LEAF_BLOCKS   NUM_ROWS
-------------------- ---------------------------------------- ----------- ----------
T1_I1                4                                                562     310000

Although the plan says it’s going to run parallel, and even though the index says it’s a parallel index, we don’t have to believe that the creation ran as a parallel task – so let’s check v$pq_tqstat, the “parallel query table queue” statistics – and this is the result I got:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Ranger                 1 QC                      12        528          4          0           0
                      Producer               1 P004               2786931   39161903          9          1           0
                                             1 P005               2422798   34045157         11          1           0
                                             1 P006               2359251   33152158         12          1           0
                                             1 P007               2431032   34160854         14          2           0
                      Consumer               1 P000               3153167   44520722          3          0           0
                                             1 P001               1364146   19126604          4          1           0
                                             1 P002               2000281   28045742          3          0           0
                                             1 P003               3482406   48826476          3          0           0

                    1 Producer               1 P000                     1        298          0          0           0
                                             1 P001                     1        298          0          0           0
                                             1 P002                     1        298          0          0           0
                                             1 P003                     1         48          0          0           0
                      Consumer               1 QC                       4       1192          2          0           0

Check the num_rows column – the first set of slaves distributed 10M rows and roughly 140MB of data to the second set of slaves – and we know that most of those rows will hold (null, rowid) which are not going to go into the index. 97% of the data that went through the message queues would have been thrown away by the second set of slaves, and “should” have been discarded by the first set of slaves.

As for the original question about the index not being built in parallel – maybe it was, but not very parallel. You’ll notice that the parallel distribution at operation 6 in the plan is “RANGE”. If 97% of your data is null and only 3% of your data is going to end up in the index then you’d need to run at higher than parallel 33 to see any long lasting executions – because at parallel 33 just one slave in the second set will get all the real data and do all the work of sorting and building the index while the other slaves will (or ought to) be just throwing their data away as it arrives. When you’ve got 500M rows with only 17M non-null entries (as the OP had) to deal with, maybe the only thing happening by the time you get to look might be the one slave that’s building a 17M row index.

Of course, one of the reasons I wanted to look at the row distribution in v$pq_tqstat was that I wanted to check whether I was going to see all the data going to one slave, or a spread across 2 slaves (Noes to the left, Ayes to the right – as they used to say in the UK House of Commons), or whether Oracle had been very clever and decided to distribute the rows by key value combined with rowid to get a nearly even spread. I’ll have to set up a different test case to check whether that last option is possible.

Footnote

There was another little oddity that might be a simpler explanation of why the OP’s index creation might actually have run serially. I dropped and recreated the index in my test case several times and at one point I noticed (from view v$pq_slave) that I had 16 slave processes live (though, at that point, IDLE). Since I was the only user of the instance my session should probably have been re-using the same set of slaves each time I ran the test; instead, at some point, one of my test runs had started up a new set of slaves. Possibly something similar had happened to the OP, and over the course of building several indexes one after the other his session had reached the stage where it tried to start “yet another” set of slaves, failed, and decided to run serially rather than reuse any of the slaves that were nominally available and IDLE.

Update

It gets worse. I decided to query v$px_sesstat (joined to v$statname) while the query was running, and caught some statistics just before the build completed. Here are a few critical numbers taken from the 4 sessions that received the 10M rows and built the final index:

Coord   Grp Deg    Set  Sno   SID
264/1     1 4/4      1    1   265
---------------------------------
            physical writes direct                            558
            sorts (memory)                                      1
            sorts (rows)                                2,541,146

264/1     1 4/4      1    2    30
---------------------------------
            sorts (memory)                                      1
            sorts (rows)                                2,218,809

264/1     1 4/4      1    3    35
---------------------------------
            physical writes direct                          7,110
            physical writes direct temporary tablespace     7,110
            sorts (disk)                                        1
            sorts (rows)                                2,886,184

264/1     1 4/4      1    4   270
---------------------------------
            sorts (memory)                                      1
            sorts (rows)                                2,353,861

Not only did Oracle pass 10M rows from one slave set to the other, the receiving slave set sorted those rows before discarding them. One of the slaves even ran short of memory and spilled its sort to disc to do the sort. And we can see (physical writes direct = 558) that one slave set was responsible for handling all the “real” data for that index.

 

Update 2

A couple of follow-ups on the thread have introduced some other material that’s worth reading.  An item from Mohamed Houri about what happens when a parallel slave is still assigned to an executing statement but isn’t given any work to do for a long time; and an item from Stefan Koehler about _px_trace and tracking down why the degree of parallelism of a statement was downgraded.

May 12, 2015

Parallel Query

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 7:22 pm GMT May 12,2015

According to the Oracle Database VLDB and Partitioning Guide (10g version and 11g version):

A SELECT statement can be executed in parallel only if the following conditions are satisfied:

  • The query includes a parallel hint specification (PARALLEL or PARALLEL_INDEX) or the schema objects referred to in the query have a PARALLEL declaration associated with them.
  • At least one table specified in the query requires one of the following:
    • A full table scan
    • An index range scan spanning multiple partitions
  • No scalar subqueries are in the SELECT list.

Note, particularly, that last restriction. I was looking at a query recently that seemed to be breaking this rule so, after examining the 10053 trace file for a while, I decided that I would construct a simplified model of the client’s query to demonstrate how the manuals can tell you the truth while being completely deceptive or (conversely) be wrong while still giving a perfectly correct impression. So here’s a query, with execution plan, from 11.2.0.4:

select
        /*+ parallel(t1 2) */
        d1.small_vc,
        t1.r1,
        t2.n21,
        t2.v21,
        t3.v31,
        (select max(v1) from ref1 where n1 = t2.n21)    ref_t2,
        (select max(v1) from ref2 where n1 = t1.r1)     ref_t1,
        t1.padding
from
        driver          d1,
        t1, t2, t3
where
        d1.n1 = 1
and     t1.n1 = d1.id
and     t1.n2 = 10
and     t1.n3 = 10
and     t2.id = t1.r2
and     t3.id = t1.r3
;

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 15700 |  1340   (3)| 00:00:07 |        |      |            |
|   1 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   2 |   TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  3 |    INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   4 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   5 |   TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  6 |    INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   7 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   8 |   PX SEND QC (RANDOM)        | :TQ10003 |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | P->S | QC (RAND)  |
|*  9 |    HASH JOIN                 |          |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 10 |     HASH JOIN                |          |   100 | 14700 |  1317   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 11 |      HASH JOIN               |          |   100 | 13300 |  1294   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  12 |       BUFFER SORT            |          |       |       |            |          |  Q1,03 | PCWC |            |
|  13 |        PX RECEIVE            |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |         PX SEND BROADCAST    | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 15 |          TABLE ACCESS FULL   | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  16 |       PX BLOCK ITERATOR      |          |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWC |            |
|* 17 |        TABLE ACCESS FULL     | T1       |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  18 |      BUFFER SORT             |          |       |       |            |          |  Q1,03 | PCWC |            |
|  19 |       PX RECEIVE             |          | 10000 |   136K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  20 |        PX SEND BROADCAST     | :TQ10001 | 10000 |   136K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  21 |         TABLE ACCESS FULL    | T2       | 10000 |   136K|    23   (5)| 00:00:01 |        |      |            |
|  22 |     BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|  23 |      PX RECEIVE              |          | 10000 |    97K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  24 |       PX SEND BROADCAST      | :TQ10002 | 10000 |    97K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  25 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    23   (5)| 00:00:01 |        |      |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=:B1)
   6 - access("N1"=:B1)
   9 - access("T3"."ID"="T1"."R3")
  10 - access("T2"."ID"="T1"."R2")
  11 - access("T1"."N1"="D1"."ID")
  15 - filter("D1"."N1"=1)
  17 - filter("T1"."N2"=10 AND "T1"."N3"=10)

Thanks to my hint the query has been given a parallel execution plan – and a check of v$pq_tqstat after running the query showed that it had run parallel. Note, however, where the PX SEND QC and PX COORDINATOR operations appear – lines 7 and 8, and above those lines we see the two scalar subqueries.

This means we’re running the basic select statement as a parallel query but the query co-ordinator has serialised on the scalar subqueries in the select list.  Is the manual “right but deceptive” or “wrong but giving the right impression” ?  Serialising on (just) the scalar subqueries can have a huge impact on the performance and effectively make the query behave like a serial query even though, technically, the statement has run as a parallel query.

You may recall that an example of this type of behaviour, and its side effects when the scalar subqueries executed independently as parallel queries, showed up some time ago. At the time I said I would follow up with a note about the change in behaviour in 12c; this seems to be an appropriate moment to show the 12c plan(s), first the default:


----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 19100 |  1364   (3)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10005 |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED        |          |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  4 |     HASH JOIN OUTER          |          |   100 | 18100 |  1340   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  5 |      HASH JOIN               |          |   100 | 16400 |  1335   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  6 |       HASH JOIN OUTER        |          |   100 | 15000 |  1311   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  7 |        HASH JOIN             |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|   8 |         PX RECEIVE           |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,05 | PCWP |            |
|   9 |          PX SEND BROADCAST   | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | S->P | BROADCAST  |
|  10 |           PX SELECTOR        |          |       |       |            |          |  Q1,00 | SCWC |            |
|* 11 |            TABLE ACCESS FULL | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | SCWP |            |
|  12 |         PX BLOCK ITERATOR    |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWC |            |
|* 13 |          TABLE ACCESS FULL   | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|  14 |        PX RECEIVE            |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  15 |         PX SEND BROADCAST    | :TQ10001 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | S->P | BROADCAST  |
|  16 |          PX SELECTOR         |          |       |       |            |          |  Q1,01 | SCWC |            |
|  17 |           VIEW               | VW_SSQ_1 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  18 |            HASH GROUP BY     |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  19 |             TABLE ACCESS FULL| REF2     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,01 | SCWP |            |
|  20 |       PX RECEIVE             |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  21 |        PX SEND BROADCAST     | :TQ10002 | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | S->P | BROADCAST  |
|  22 |         PX SELECTOR          |          |       |       |            |          |  Q1,02 | SCWC |            |
|  23 |          TABLE ACCESS FULL   | T2       | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | SCWP |            |
|  24 |      PX RECEIVE              |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  25 |       PX SEND BROADCAST      | :TQ10003 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | S->P | BROADCAST  |
|  26 |        PX SELECTOR           |          |       |       |            |          |  Q1,03 | SCWC |            |
|  27 |         VIEW                 | VW_SSQ_2 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  28 |          HASH GROUP BY       |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  29 |           TABLE ACCESS FULL  | REF1     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,03 | SCWP |            |
|  30 |     PX RECEIVE               |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  31 |      PX SEND BROADCAST       | :TQ10004 | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | S->P | BROADCAST  |
|  32 |       PX SELECTOR            |          |       |       |            |          |  Q1,04 | SCWC |            |
|  33 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | SCWP |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T3"."ID"="T1"."R3")
   4 - access("ITEM_2"(+)="T2"."N21")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("ITEM_1"(+)="T1"."R1")
   7 - access("T1"."N1"="D1"."ID")
  11 - filter("D1"."N1"=1)
  13 - filter("T1"."N2"=10 AND "T1"."N3"=10)

The first thing to note is the location of the PX SEND QC and PX COORDINATOR operations – right at the top of the plan: there’s no serialisation at the query coordinator. Then we spot the views at operations 17 and 27 – VW_SSQ_1, VW_SSQ_2 (would SSQ be “scalar subquery”, perhaps). The optimimzer has unnested the scalar subqueries out of the select list into the join. When a scalar subquery in the select list returns no data it’s value is deemed to be NULL so the joins (operations 4 and 6) have to be outer joins.

You’ll notice that there are a lot of PX SELECTOR operations – each feeding a PX SEND BROADCAST operations that reports its “IN-OUT” column as S->P (i.e. serial to parallel). Historically a serial to parallel operation started with the query coordinator doing the serial bit but in 12c the optimizer can dictate that one of the PX slaves should take on that task (see Randolf Geist’s post here). Again my code to report v$pq_tqstat confirmed this behaviour in a way that we shall see shortly.

This type of unnesting is a feature of 12c and in some cases will be very effective. It is a cost-based decision, though, and the optimizer can make mistakes; fortunately we can control the feature. We could simply set the optimizer_features_enable back to 11.2.0.4 (perhaps through the hint) and this would take us back to the original plan, but this isn’t the best option in this case. There is a hidden parameter _optimizer_unnest_scalar_sq enabling the feature so we could, in principle, disable just the one feature by setting that parameter to false; a more appropriate strategy would simply be to tell the optimizer that it should not unnest the subqueries. In my case I could put the /*+ no_unnest */ hint into both the subqueries or use the qb_name() hint to give the two subquery blocks names, and then used the /*+ no_unnest() */ hint with the “@my_qb_name” format at the top of the main query. Here’s the execution plan I get whether I use the hidden parameter or the /*+ no_unnest */ mechanim:

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |       |       |  1554 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                 |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10003 |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    EXPRESSION EVALUATION        |          |       |       |            |          |  Q1,03 | PCWC |            |
|*  4 |     HASH JOIN                   |          |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  5 |      HASH JOIN                  |          |   100 | 14700 |  1330   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  6 |       HASH JOIN                 |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|   7 |        BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|   8 |         PX RECEIVE              |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|   9 |          PX SEND BROADCAST      | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL     | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  11 |        PX BLOCK ITERATOR        |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWC |            |
|* 12 |         TABLE ACCESS FULL       | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|  13 |       BUFFER SORT               |          |       |       |            |          |  Q1,03 | PCWC |            |
|  14 |        PX RECEIVE               |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  15 |         PX SEND BROADCAST       | :TQ10001 | 10000 |   136K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  16 |          TABLE ACCESS FULL      | T2       | 10000 |   136K|    24   (5)| 00:00:01 |        |      |            |
|  17 |      BUFFER SORT                |          |       |       |            |          |  Q1,03 | PCWC |            |
|  18 |       PX RECEIVE                |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  19 |        PX SEND BROADCAST        | :TQ10002 | 10000 |    97K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  20 |         TABLE ACCESS FULL       | T3       | 10000 |    97K|    24   (5)| 00:00:01 |        |      |            |
|  21 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  22 |      TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 23 |       INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|  24 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  25 |      TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 26 |       INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."ID"="T1"."R3")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("T1"."N1"="D1"."ID")
  10 - filter("D1"."N1"=1)
  12 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N2"=10 AND "T1"."N3"=10))
  23 - access("N1"=:B1)
  26 - access("N1"=:B1)

Note particularly that the PX SEND QC and PX COORDINATOR operations are operations 2 and 1,, and we have a new operator EXPRESSSION EVALUATION at operation 3. This has three child operations – the basic select starting at operation 4, and the two scalar subqueries starting at lines 21 and 24. We are operating the scalar subqueries as correlated subqueries, but we don’t leave all the work to the query coordinator – each slave is running its own subqueries before forwarding the final result to the coordinator. There is a little side effect that goes with this change – the “serial to parallel” operations are now, as they always used to be, driven by the query co-ordinator, the PX SELECTOR operations have disappeared.

And finally

Just to finish off, let’s take a look at the results from v$pq_tqstat in 12.1.0.2. First from the default plan with the PX SELECTOR operations. Remember that this turned into a five table join where two of the “tables” were non-correlated aggregate queries against the reference tables.


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P002                   200       2428          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                   100       1238         59         27           0
                                             1 P001                   100       1238         41         24           0

                    1 Producer               1 P002                  2000      23830          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                  1000      11939         57         26           0
                                             1 P001                  1000      11939         41         24           0

                    2 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     339732          0          0           0
                      Consumer               1 P000                 10000     169890         49         22           0
                                             1 P001                 10000     169890         31         21           0

                    3 Producer               1 P002                     0         48          0          0           0
                                             1 P003                  2000      23830          0          0           0
                      Consumer               1 P000                  1000      11939         58         26           0
                                             1 P001                  1000      11939         38         23           0

                    4 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     239986          0          0           0
                      Consumer               1 P000                 10000     120017         50         22           0
                                             1 P001                 10000     120017         34         21           0

                    5 Producer               1 P000                     1        169          0          0           0
                                             1 P001                     1        169          1          0           0
                      Consumer               1 QC                       2        338          3          0           0

As you read down the table queues you can see that in the first five table queues (0 – 4) we seem to operate parallel to parallel, but only one of the two producers (p002 and p003) produces any data at each stage. A more traditional plan would show QC as the single producer in each of these stages.

Now with scalar subquery unnesting blocked – the plan with the three table join and EXPRESSION EVALUATION – we see the more traditional serial to parallel, the producer is QC in all three of the first table queues (the full scan and broadcast of tables t1, t2, and t3).

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 QC                     200       1726          0          0           0
                      Consumer               1 P000                   100       1614         28         15           0
                                             1 P001                   100       1614         34         13           0

                    1 Producer               1 QC                   20000     339732          0          0           0
                      Consumer               1 P000                 10000     169866         19         10           0
                                             1 P001                 10000     169866         25          8           0

                    2 Producer               1 QC                   20000     239986          0          0           0
                      Consumer               1 P000                 10000     119993         23         11           0
                                             1 P001                 10000     119993         31         11           0

                    3 Producer               1 P000                     1        155          1          0           0
                                             1 P001                     1        155          0          0           0
                      Consumer               1 QC                       2        310          3          1           0

It’s an interesting point that this last set of results is identical to the set produced in 11g – you can’t tell from v$pq_tqstat whether the parallel slaves or the query co-ordinator executed the subqueries – you have to look at the output from SQL trace (or similar) to see the individual Rowsource Executions Statistics for the slaves and coordinator to see which process actually ran the subqueries.

 

February 12, 2015

Parallel rownum

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 7:27 am GMT Feb 12,2015

It’s easy to make mistakes, or overlook defects, when constructing parallel queries – especially if you’re a developer who hasn’t been given the right tools to make it easy to test your code. Here’s a little trap I came across recently that’s probably documented somewhere, which could be spotted easily if you had access to the OEM SQL Monitoring screen, but would be very easy to miss if you didn’t check the execution plan very carefully. I’ll start with a little script to generate some data:


create table t1 nologging
as
select * from all_objects where rownum <= 50000
;

insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

create table t2 as select * from t1;
alter table t2 add id number(10,0);

All I’ve done is create some data – 800,000 rows – and then create a table to copy it to; and while I copy it I’m going to add a temporary id to the rows, which I’ll do with a call to rownum; and since there’s a lot of data I’ll use parallel execution:


alter session enable parallel dml;

insert /*+ parallel(t2 3) */ into t2
select /*+ parallel(t1 4) */ t1.* , rownum from t1;

For the purposes of experiment and entertainment I’ve done something a little odd by supplying two possible degrees of parallelism, but this lets me ask the question: will this statement run parallel 3, parallel 4, both of the above, or neither ? (You may assume that I have enough parallel execution slaves available when the statement runs.)

The answer is both – because that rownum does something nasty to the execution plan (I didn’t include the 50,000 limit when creating t1 in my original test, which is why the plan reports 993K rows instead of 800,000):


--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT           |          |   993K|    92M|  1076   (1)| 00:00:13 |        |      |            |
|   1 |  PX COORDINATOR            |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)      | :TQ20001 |   993K|    92M|  1076   (1)| 00:00:13 |  Q2,01 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT          | T2       |       |       |            |          |  Q2,01 | PCWP |            |
|   4 |     PX RECEIVE             |          |   993K|    92M|  1076   (1)| 00:00:13 |  Q2,01 | PCWP |            |
|   5 |      PX SEND ROUND-ROBIN   | :TQ20000 |   993K|    92M|  1076   (1)| 00:00:13 |        | S->P | RND-ROBIN  |
|   6 |       COUNT                |          |       |       |            |          |        |      |            |
|   7 |        PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   8 |         PX SEND QC (RANDOM)| :TQ10000 |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | P->S | QC (RAND)  |
|   9 |          PX BLOCK ITERATOR |          |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL| T1       |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------

See that “P->S” (parallel to serial) at operation 8. The select statement runs in parallel (degree 4) to scan the data, and then sends it all to the query co-ordinator to supply the rownum; then the query co-ordinator re-distributes the data (including rownum) to another set of slaves (S->P) to do the parallel (degree 3) insert. The P->S at line 2 shows the parallel execution slaves passing details to the query co-ordinator of the private segments that they have created so that the query co-ordinator can stitch the segments together into a single data segment for the table. (If you watch the run-time activity closely you’ll see the query co-ordinator doing a few local writes as it tidies up the header blocks in those segment blocks.)

There are two threats to this rownum detail. The first, of course, is that the operation essentially serialises through the query co-ordinator so it’s going to take longer than you might expect; secondly an accident of this type is typically going to allocate twice as many parallel execution slaves as you might have expected – the select and the insert are two separate data flow operations (note how the Name column shows TQ1xxxx and TQ2xxxx), each gets its own slave sets, and both sets of slaves are held for the duration of the statement. If this statement is demanding twice the slaves it should be using, then you may find that other statements that start running at the same time get their degree of parallelism downgraded because you’ve run out of PX slaves. Although the rownum solution is nice and clean – it require no further infrastructure – you probably need to introduce a sequence (with a large cache) to get the same effect without losing parallelism.

If you look at v$pq_tqstat after running this statement the results are a little disappointing – there are a few problems connecting lines from the plan with rows in the view – here’s my original output (and you’ll now see why I chose to have two different degrees of parallelism):


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Consumer               1 P000                331330   39834186         74         71           0
                                             1 P001                331331   39844094         75         72           0
                                             1 P002                330653   39749806         74         71           0

                    1 Producer               1 P000                     1        131       2263        396           0
                                             1 P001                     1        131       2238        417           0
                                             1 P002                     1        131       2182        463           0

         2          0 Producer               1 P003                247652   28380762         13          0           0
                                             1 P004                228857   26200574         13          1           0
                                             1 P005                267348   30496182         14          0           0
                                             1 P006                249457   28401982         13          0           0
                                             1 QC                  993314  119428086 4294967269 4294967286           0
                      Consumer               1 QC                  993314  113479500        125         65           0

                    1 Consumer               1 QC                       3        393          2          1           0

The first problem is that the DFO_number reported in the view doesn’t match with the :TQ1xxxx and :TQ2xxxx reported in the plan – the parallel 4 bit is the select, which is covered by :TQ1000, but it’s listed under DFO_Number = 2 in the view, and the insert is the parallel 3 bit, which is covered by :TQ2000 and :TQ20001 but listed under DFO_Number = 1.

More confusingly, potentially, is that the all appearances of the query coordinator have been assigned to DFO_Number = 2. Ignoring the fact that the DFO_Number column switches the 1 and 2 from the plan, what we should see is as follows:

  • The consumer at line 16 is consuming from the 4 producers at lines 11 – 14.
  • The producer at line 15 is producing FOR the 3 consumers at lines 3 – 5
  • The consumer at line 18 is consuming from the producers at lines 7 – 9

Ideally (including the correction for the DFO_Number) I think the view content should be as follows:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P003                247652   28380762         13          0           0
                                             1 P004                228857   26200574         13          1           0
                                             1 P005                267348   30496182         14          0           0
                                             1 P006                249457   28401982         13          0           0
                      Consumer               1 QC                  993314  113479500        125         65           0

         2          0 Producer               1 QC                  993314  119428086 4294967269 4294967286           0
                      Consumer               1 P000                331330   39834186         74         71           0
                                             1 P001                331331   39844094         75         72           0
                                             1 P002                330653   39749806         74         71           0

                    1 Producer               1 P000                     1        131       2263        396           0
                                             1 P001                     1        131       2238        417           0
                                             1 P002                     1        131       2182        463           0
                      Consumer               1 QC                       3        393          2          1           0

Just don’t ask me why the waits and timeouts for the QC as producer seem to be counting backwards from 2^32.

November 21, 2014

Parallel Costs

Filed under: CBO,Oracle,Parallel Execution — Jonathan Lewis @ 11:36 am GMT Nov 21,2014

While creating a POC of a SQL rewrite recently I received a little surprise as I switched my query from serial execution to parallel execution and saw the optimizer’s estimated cost increase dramatically. I’ll explain why in a moment, but it made me think it might be worth setting up a very simple demonstration of the anomaly. I created a table t1 by copying view all_source – which happened to give me a table with about 100,000 rows and 1117 blocks – and then ran the query ‘select max(line) from t1;’ repeating the query with a /*+ parallel(t1 2) */ hint. From 11.2.0.4 here are the two execution plans I got:


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   153   (4)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 99173 |   387K|   153   (4)| 00:00:01 |
---------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     4 |  1010   (1)| 00:00:05 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     4 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     4 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     4 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 99173 |   387K|  1010   (1)| 00:00:05 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 99173 |   387K|  1010   (1)| 00:00:05 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

It seems that the optimizer thinks that running the query parallel 2 will take five times as long as running it serially ! Is it, perhaps, some cunning fix to the optimizer that is trying to allow for process startup time for the parallel execution slaves ? Is it a bug ? No – it’s just that I hadn’t got around to setting my system stats and they were complete garbage thanks to various other tests I had been running over the previous couple of weeks. Critically, I had not adjusted the “parallel-oriented” statistics to be consistent with the “serial-oriented” stats.

Here, from the 10053 trace file for the parallel run, is the section on the Single Table Access costing, together with the system stats and base statistical information:


SYSTEM STATISTICS INFORMATION
-----------------------------
  Using WORKLOAD Stats
  CPUSPEED: 1110 millions instructions/sec
  SREADTIM: 4.540000 milliseconds
  MREADTIM: 12.440000 millisecons
  MBRC:     21 blocks
  MAXTHR:   3000000 bytes/sec
  SLAVETHR: 1000000 bytes/sec

***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table: T1  Alias: T1
    #Rows: 99173  #Blks:  1117  AvgRowLen:  76.00  ChainCnt:  0.00
Access path analysis for T1
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T1[T1]
  Table: T1  Alias: T1
    Card: Original: 99173.000000  Rounded: 99173  Computed: 99173.00  Non Adjusted: 99173.00
  Access Path: TableScan
    Cost:  151.13  Resp: 1010.06  Degree: 0
      Cost_io: 147.00  Cost_cpu: 20826330
      Resp_io: 1007.76  Resp_cpu: 11570183

I’m going to walk through the optimizer’s calculations that got the serial I/O cost (cost_io: 147.00) and the parallel I/O cost (Resp_io: 1007.76), but before I do that I’ll point out how inconsistent some of the system stat are. The multiblock read time (mreadtim) is 12.44 milliseconds, to read an average of 21 blocks (mbrc) which, converting to bytes per second means (21 * 8192) * 1000/12.44 = 13,828,938 bytes per second; but the I/O rate for a single parallel execution slave (slavethr) is only 1,000,000 bytes per second – which means a single serial session can (apparently) tablescan nearly 14 times as fast as an individual parallel execution slave. It’s not surprising that somehow the optimizer thinks a serial tablescan will be faster than parallel 2) – but let’s check exactly how the arithmetic falls out.

Serial:

  • #Blks: 1117, MBRC = 21 => read requests = 1117/21 = 53.19
  • sreadtim = 4.54 milliseconds, mreadtim = 12.44 milliseconds = 2.74 * sreadtim
  • Cost = 53.19 * 2.74 + 1 (_tablescan_cost_plus_one = true) = 146.74 — which looks close enough.

Parallel:

  • #Blks: 1117, block size = 8KB => total I/O requirement = 9,150,464 bytes
  • slavethr: 1,000,000 bytes/sec, degree 2 => total throughput 2,000,000 bytes/sec => elapsed I/O time = 4.575232 seconds
  • sreadtim = 4.54 milliseconds => cost = 4.575232 / 0.00454 = 1007.76 QED.

Two final thoughts:

First, if I increase the degree of parallelism to 3 the cost drops to 673 (671.84 plus a bit of CPU); if I increase the degree any further the cost doesn’t drop any further – that’s because the maxthr (maximum throughput) is 3,000,000. The optimizer uses the smaller of maxthr and (degree * slavethr) in the parallel arithmetic.

Secondly, and most importantly, the way the optimizer behaves will be affected by the various parameters relating to parallelism, so here’s a list of the relevant settings on the instance I was using when creating this example:


SQL> show parameter parallel

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
fast_start_parallel_rollback         string      LOW
parallel_adaptive_multi_user         boolean     TRUE
parallel_automatic_tuning            boolean     FALSE
parallel_degree_limit                string      CPU
parallel_degree_policy               string      MANUAL
parallel_execution_message_size      integer     16384
parallel_force_local                 boolean     FALSE
parallel_instance_group              string
parallel_io_cap_enabled              boolean     FALSE
parallel_max_servers                 integer     80
parallel_min_percent                 integer     0
parallel_min_servers                 integer     0
parallel_min_time_threshold          string      AUTO
parallel_server                      boolean     FALSE
parallel_server_instances            integer     1
parallel_servers_target              integer     32
parallel_threads_per_cpu             integer     2
recovery_parallelism                 integer     0

Note, particularly, that I have not enabled parallel_automatic_tuning.

For further details on parallel costs and the cost/time equivalence that the optimizer uses in recent versions, here’s a link to an article by Randolf Geist.

November 12, 2014

Parallel Fun

Filed under: Execution plans,Oracle,Parallel Execution,subqueries — Jonathan Lewis @ 4:42 pm GMT Nov 12,2014

As I write, there’s an ongoing thread on Oracle-L that started with the (paraphrased) question: “I’ve got this query that returns 7 million rows; when I change it to ‘select count(*)’ it returns in 4 seconds but when we display the full result set on screen it takes hours, and every second or two the screen pauses; how do I make it go faster.”

The general rapid response was: “You shouldn’t be running 7M rows to a screen – the time is the time for the network traffic and display.”

The first part of the statement is right – the second part is quite likely to be wrong and there’s a very strong hint in the question that makes me say that, it’s the “pauses every second or two”. Of course we don’t know what the OP isn’t telling us, and we don’t know how accurate he is in what he is telling us, so any ideas we have may be completely wrong. For example, we haven’t been given any idea of how long a “pause” is, we don’t really know how accurate that “second or two” might be and whether “every” is an exaggeration, and maybe the query is returning CLOB columns (and that could make a big difference to what you can do to improve performance).

If we take the statement at face value, though, there is one very obvious inference: although some of the time will be due to network traffic time, most of the time is probably due to Oracle doing something expensive for a significant fraction of the rows returned. The pattern of activity probably looks like this:

  • client: call server to fetch next array of rows
  • server: spend some time populating array  — this is where the client sees a pause
  • client: display result array
  • client: call server to fetch next array of rows
  •  etc…

Here’s a trivial example:

connect / as sysdba
set arraysize 500
set pagesize 40

select
        o1.spare1 ,
        (
        select  max((ctime))
        from    obj$    o2
        where   o2.owner# = o1.owner#
        and     o2.obj# < o1.obj#
        ) ct
from obj$ o1
;

On my laptop, running an instance of 11.2.0.4 with about 80,000 rows in obj$ (and a lot of them owned by SYS), I can count seconds and find that (approximately) I alternate between one second watching results scrolling up the screen and one second waiting as the server generates the next 500 rows.

Of course it’s possible to argue that the problem really is the network and nothing but the network struggling to cope with the never-ending stream of little packets produced by 7M rows. Could there be a choke point that causes the data to stop and start with great regularity, maybe – but previous experience says probably not. I have experienced bad network problems in the past, but when they’ve occurred I’ve always observed extremely random stop/go behaviour. The regularity implied in the question makes the Oracle-based problem seem far more likely.

Conveniently a couple of people asked for more clues – like the query text and the execution plan; even more conveniently the OP supplied the answers in this response. Since the email format makes them a little hard to read I’ve copied them here:


SELECT  bunch of stuff.....,

        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr1.RELATED_SID
                        ||
                        ',')
                ORDER BY sr1.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr1
                WHERE   sr1.SID                    = slv.SID
                        AND sr1.RELATIONSHIP_LEVEL = '1'
                GROUP BY sr1.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL1,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr2.RELATED_SID
                        ||
                        ',')
                ORDER BY sr2.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr2
                WHERE   sr2.SID                    = slv.SID
                        AND sr2.RELATIONSHIP_LEVEL = '2'
                GROUP BY sr2.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL2,
        (
               SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr3.RELATED_SID
                        ||
                        ',')
                ORDER BY sr3.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr3
                WHERE   sr3.SID                    = slv.SID
                        AND sr3.RELATIONSHIP_LEVEL = '3'
                GROUP BY sr3.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL3,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr4.RELATED_SID
                        ||
                        ',')
                ORDER BY sr4.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr4
                WHERE   sr4.SID                    = slv.SID
                        AND sr4.RELATIONSHIP_LEVEL = '4'
                GROUP BY sr4.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL4,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr5.RELATED_SID
                        ||
                        ',')
                ORDER BY sr5.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr5
                WHERE   sr5.SID                    = slv.SID
                        AND sr5.RELATIONSHIP_LEVEL = '5'
                GROUP BY sr5.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL5
FROM    service_lookup slv
        LEFT JOIN service_location sl
        ON      sl.service_location_id = slv.service_location_id;

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1570133209

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                      |  7331K|  5593M|  1877   (5)| 00:00:01 |        |      |            |
|   1 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|   2 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10000             |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   4 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | PCWC |            |
|*  5 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | PCWP |            |
|   6 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|   7 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|   8 |    PX SEND QC (RANDOM)           | :TQ20000             |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | P->S | QC (RAND)  |
|   9 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | PCWC |            |
|* 10 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | PCWP |            |
|  11 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  12 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  13 |    PX SEND QC (RANDOM)           | :TQ30000             |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | P->S | QC (RAND)  |
|  14 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | PCWC |            |
|* 15 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | PCWP |            |
|  16 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  17 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  18 |    PX SEND QC (RANDOM)           | :TQ40000             |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | P->S | QC (RAND)  |
|  19 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | PCWC |            |
|* 20 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | PCWP |            |
|  21 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  22 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  23 |    PX SEND QC (RANDOM)           | :TQ50000             |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | P->S | QC (RAND)  |
|  24 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | PCWC |            |
|* 25 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | PCWP |            |
|  26 |  PX COORDINATOR                  |                      |       |       |            |          |        |      |            |
|  27 |   PX SEND QC (RANDOM)            | :TQ60002             |  7331K|  5593M|  1877   (5)| 00:00:01 |  Q6,02 | P->S | QC (RAND)  |
|* 28 |    HASH JOIN RIGHT OUTER BUFFERED|                      |  7331K|  5593M|  1877   (5)| 00:00:01 |  Q6,02 | PCWP |            |
|  29 |     PX RECEIVE                   |                      |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,02 | PCWP |            |
|  30 |      PX SEND HASH                | :TQ60000             |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | P->P | HASH       |
|  31 |       PX BLOCK ITERATOR          |                      |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | PCWC |            |
|  32 |        TABLE ACCESS STORAGE FULL | SERVICE_LOCATION     |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | PCWP |            |
|  33 |     PX RECEIVE                   |                      |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,02 | PCWP |            |
|  34 |      PX SEND HASH                | :TQ60001             |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | P->P | HASH       |
|  35 |       PX BLOCK ITERATOR          |                      |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | PCWC |            |
|  36 |        TABLE ACCESS STORAGE FULL | SERVICE_LOOKUP       |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------------------

We have a simple two-table outer join, and five scalar subqueries in the select list. (Not being very familiar with the various XML calls I had no idea of what the scalar subqueries were doing, or how they produced a result, beyond the fact that they were querying and aggregating multiple rows. In fact the combination of calls does much the same as listagg(), though it allows for a CLOB result (which could be part of the performance problem, of course) rather than being limited to a varchar2() result).

Would you like to guess at this point why I constructed my demonstration query again obj$ the way I did when presenting the idea of high-cost per row queries as a reason for regular pauses in the output ? The execution plan matched one of my two initial guesses about what the query was going to look like. When you “select count(*) from {this query}”, the optimizer will factor out the scalar subqueries and only have to count the result set from the hash join – and it might even manage to use a couple of parallel index fast full scans to get that result rather than doing the tablescans. When you run the query you have to run the scalar subqueries.

If we trust the statistics, we have 5 subqueries to run for each row of the hash join – and the hash join is predicted to return 7.3 million rows. Given that the subqueries are all going to run parallel tablescans against a fairly large table (note – the cost of the tablescans on SERVICE_RELATIONSHIP is 368, compared to the cost of the tablescan on SERVICE_LOCATION which is 366 to return 3.1M rows) that’s an awful lot of work for each row returned – unless we benefit from an enormous amount of scalar subquery caching.

Here’s another performance threat that the plan shows, though: notice where the PX SEND QC operation appears – that means the PX slaves send their (7M) rows to the Query Co-ordinator and the QC is responsible for doing all the work of running the scalar subqueries. Another interesting little threat visible in the plan shows up in the TQ column – the plan uses six “data flow operations” (using the original naming convention, though that changed some time ago but survived in the column names of v$pq_tqstat). In principle each DFO could allocate two sets of PX slaves (and every DFO could have a different degree of parallelism); in this example DFO number 6 (the driving hash join) uses two sets of slave, and the other five DFOs (the scalar subqueries) use a single set each. The upshot of this is that if the default degree of parallelism in play is N this query will allocate 7N parallel query slaves. It gets a little nastier than that, though (based on checking the output from v$sql_plan_monitor), because each time one of the scalar subqueries runs Oracle seems to allocate and deallocate the slaves that are supposed to run it – which is probably going to cause some contention if there are other parallel queries trying to run at the same time.

Optimisation

So what could you do with this query ? It depends on how much change you want to make to the code.

It’s possible that an index on service_relationship(relationship_level, sid) – with compress 1 – might help if it’s very precise, and if the target table stays in the buffer cache for the duration of the query – but, in the absence scalar subquery caching that could still leave the query co-ordinator executing 35 million (5 queries x 7 million rows) subqueries in a serialised process.

A better bet may be to convert from subqueries to joins – remembering that the listagg() / xmlserialize() calls will require you to aggregate (which means sorting in this case) an estimated 25 rows per driving row per relationship_level; in other words you may need to sort 7M * 125 = 875M rows – but at least you could do that in parallel, and there’s always the possibility that the estimated 25 drops off as you work through the different levels. You could choose to do 5 outer hash joins or (as Iggy Fernandez outlined in the thread) you could do a single outer join with a decode on the relationship_level. Another variation on this theme (which would probably have a plan showing ‘join then aggregate’) would be to ‘aggregate then join’. It’s possible that creating a non-mergeable inline view for the 5 values of relationsip_level from a single table access, aggregating it to produce the five required columns, then using the result in an outer join, would be the most efficient option. In the absence of a detailed understanding of the data volume and patterns it’s hard to make any prediction of which strategy would work best.

Footnote:

I may be wrong in my analysis of this problem. When I first saw the question the reason for the performance pattern suggested an “obvious” design error in either the SQL or the infrastructure, and when I saw that the query and execution plan matched my prediction it became very hard for me to think that there might be some other significant cause.

There were a couple of interesting details in the execution plan that made me pursue the problem a little more. In the first case I built a very simple model to get an estimate of the time needed to display 7M rows of a reasonable width in SQL*Plus running across a typical LAN (my estimate was in the order of 45 minutes – not hours); then I spent a little more time (about 10 minutes) to build a model that reproduced the key features of the execution plan shown.

I then spent two or three hours playing with the model, and I’ll be writing a further blog with some of the results later on. One detail to carry away today, though, is that in 12c Oracle can do a new form of subquery unnesting which transformed the query from its 5 scalar subquery form into the seven table join form that was one of the suggestions made on the thread; even more interestingly, if I blocked the unnesting (to force the subquery execution) Oracle 12.1.0.2 came up with a new operator (EXPRESSION EVALUATION) that allowed it to run the subqueries from the PX slaves before passing the results to the query co-ordinator – in other words eliminating the serialisation point.

To be continued …

July 27, 2014

Parallel Plans

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 8:39 pm GMT Jul 27,2014

This is the directory for a short series I wrote discussing how to interpret parallel execution plans in newer versions of Oracle.

For other aspects of parallel execution the best, and most wide-ranging, source is  Randolf Geist’s blog, which contains so much material that I’ve had to create a catalogue for my own convenience. All the following links point to work by Randof.

Parallel Data Skew

One of the awkward problems you can encounter with parallel execution is data skew – which has the potential to make just one slave in a set do (almost) all the work hence reducing the performance to something close to serial execution times.  Randolf has written a series of articles on Parallel Skew that has been published by AllthingsOracle over the last few months.

XPLAN_ASH

 

Parallel Execution New Features (12c)

 

 

 

 

March 10, 2014

Parallel Execution – 5

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 1:30 pm GMT Mar 10,2014

In the last article (I hope) of this series I want to look at what happens when I change the parallel distribution method on the query that I’ve been using in my previous demonstrations.  This was a query first introduced in a note on Bloom Filters (opens in a separate window) where I show two versions of a four-table parallel hash join, one using using the broadcast distribution mechanism throughout, the other using the hash distribution method. For reference you can review the table definitions and plan (with execution stats) for the serial join in this posting (also opens in a separate window).
(more…)

March 5, 2014

12c pq_replicate

Filed under: 12c,Exadata,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 5:39 pm GMT Mar 5,2014

One of the changes that appeared in 12c was a change to the BROADCAST distribution option for parallel execution. I mentioned this in a footnote to a longer article a couple of months ago; this note simply expands on that brief comment with an example. We’ll start with a simple two-table hash join – which I’ll first construct and demonstrate in 11.2.0.4:
(more…)

February 28, 2014

Empty Hash

Filed under: Bugs,CBO,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:45 pm GMT Feb 28,2014

A little while ago I highlighted a special case with the MINUS operator (that one of the commentators extended to include the INTERSECT operator) relating to the way the second subquery would take place even if the first subquery produced no rows. I’ve since had an email from an Oracle employee letting me know that the developers looked at this case and decided that it wasn’t feasible to address it because – taking a wider view point – if the query were to run parallel they would need a mechanism that allowed some synchronisation between slaves so that every slave could find out that none of the slaves had received any rows from the first subquery, and this was going to lead to hanging problems.

The email reminded me that there’s another issue of the same kind that I discovered several years ago – I thought I’d written it up, but maybe it was on a newsgroup or forum somewhere, I can’t find it on my blog or old website). The problem can be demonstrated by this example:

(more…)

February 26, 2014

Parallel Execution – 4

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 1:58 pm GMT Feb 26,2014

I’m aware that in the previous article in this series I said I’d continue “in a few days” and it has now been more like 11 weeks – but finally I’ve got the time. In this article I’m going to talk primarily about Bloom filters and their impact on performance, but I’ll need to say something about the “virtual tables” and “parallel execution message size” before I begin. Take a look at this fragment of a parallel execution plan:

(more…)

Next Page »

Blog at WordPress.com.