Oracle Scratchpad

July 15, 2015

PQ Index anomaly

Filed under: Indexing,Oracle,Parallel Execution — Jonathan Lewis @ 8:42 am BST Jul 15,2015

Here’s an oddity prompted by a question that appeared on Oracle-L last night. The question was basically – “Why can’t I build an index in parallel when it’s single column with most of the rows set to null and only a couple of values for the non-null entries”.

That’s an interesting question, since the description of the index shouldn’t produce any reason for anything to go wrong, so I spent a few minutes on trying to emulate the problem. I created a table with 10M rows and a column that was 3% ‘Y’ and 0.1% ‘N’, then created and dropped an index in parallel in parallel a few times. The report I used to prove that the index build had run  parallel build showed an interesting waste of resources. Here’s the code to build the table and index:


create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        case
                when mod(rownum,100) < 3 then 'Y'
                when mod(rownum,1000) = 7 then 'N'
        end                     flag,
        rownum                  id,
        rpad('x',30)            padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7
;

-- gather stats here

explain plan for
create index t1_i1 on t1(flag) parallel 4 nologging
;

select * from table(dbms_xplan.display);

create index t1_i1 on t1(flag) parallel 4 nologging;

select index_name, degree, leaf_blocks, num_rows from user_indexes;
alter index t1_i1 noparallel;

As you can see, I’ve used explain plan to get Oracle’s prediction of the cost and size, then I’ve created the index, then checked its size (and set it back to serial from its parallel setting). Here are the results of the various queries (from 11.2.0.4) – it’s interesting to note that Oracle thinks there will be 10M index entries when we know that “completely null entries don’t go into the index”:

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------
|   0 | CREATE INDEX STATEMENT   |          |    10M|    19M|  3073   (3)| 00:00:16 |        |      |            |
|   1 |  PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |    10M|    19M|            |          |  Q1,01 | P->S | QC (ORDER) |
|   3 |    INDEX BUILD NON UNIQUE| T1_I1    |       |       |            |          |  Q1,01 | PCWP |            |
|   4 |     SORT CREATE INDEX    |          |    10M|    19M|            |          |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE          |          |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANGE      | :TQ10000 |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | P->P | RANGE      |
|   7 |        PX BLOCK ITERATOR |          |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL| T1       |    10M|    19M|  2158   (4)| 00:00:11 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------

Note
-----
   - estimated index size: 243M bytes

INDEX_NAME           DEGREE                                   LEAF_BLOCKS   NUM_ROWS
-------------------- ---------------------------------------- ----------- ----------
T1_I1                4                                                562     310000

Although the plan says it’s going to run parallel, and even though the index says it’s a parallel index, we don’t have to believe that the creation ran as a parallel task – so let’s check v$pq_tqstat, the “parallel query table queue” statistics – and this is the result I got:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Ranger                 1 QC                      12        528          4          0           0
                      Producer               1 P004               2786931   39161903          9          1           0
                                             1 P005               2422798   34045157         11          1           0
                                             1 P006               2359251   33152158         12          1           0
                                             1 P007               2431032   34160854         14          2           0
                      Consumer               1 P000               3153167   44520722          3          0           0
                                             1 P001               1364146   19126604          4          1           0
                                             1 P002               2000281   28045742          3          0           0
                                             1 P003               3482406   48826476          3          0           0

                    1 Producer               1 P000                     1        298          0          0           0
                                             1 P001                     1        298          0          0           0
                                             1 P002                     1        298          0          0           0
                                             1 P003                     1         48          0          0           0
                      Consumer               1 QC                       4       1192          2          0           0

Check the num_rows column – the first set of slaves distributed 10M rows and roughly 140MB of data to the second set of slaves – and we know that most of those rows will hold (null, rowid) which are not going to go into the index. 97% of the data that went through the message queues would have been thrown away by the second set of slaves, and “should” have been discarded by the first set of slaves.

As for the original question about the index not being built in parallel – maybe it was, but not very parallel. You’ll notice that the parallel distribution at operation 6 in the plan is “RANGE”. If 97% of your data is null and only 3% of your data is going to end up in the index then you’d need to run at higher than parallel 33 to see any long lasting executions – because at parallel 33 just one slave in the second set will get all the real data and do all the work of sorting and building the index while the other slaves will (or ought to) be just throwing their data away as it arrives. When you’ve got 500M rows with only 17M non-null entries (as the OP had) to deal with, maybe the only thing happening by the time you get to look might be the one slave that’s building a 17M row index.

Of course, one of the reasons I wanted to look at the row distribution in v$pq_tqstat was that I wanted to check whether I was going to see all the data going to one slave, or a spread across 2 slaves (Noes to the left, Ayes to the right – as they used to say in the UK House of Commons), or whether Oracle had been very clever and decided to distribute the rows by key value combined with rowid to get a nearly even spread. I’ll have to set up a different test case to check whether that last option is possible.

Footnote

There was another little oddity that might be a simpler explanation of why the OP’s index creation might actually have run serially. I dropped and recreated the index in my test case several times and at one point I noticed (from view v$pq_slave) that I had 16 slave processes live (though, at that point, IDLE). Since I was the only user of the instance my session should probably have been re-using the same set of slaves each time I ran the test; instead, at some point, one of my test runs had started up a new set of slaves. Possibly something similar had happened to the OP, and over the course of building several indexes one after the other his session had reached the stage where it tried to start “yet another” set of slaves, failed, and decided to run serially rather than reuse any of the slaves that were nominally available and IDLE.

Update

It gets worse. I decided to query v$px_sesstat (joined to v$statname) while the query was running, and caught some statistics just before the build completed. Here are a few critical numbers taken from the 4 sessions that received the 10M rows and built the final index:

Coord   Grp Deg    Set  Sno   SID
264/1     1 4/4      1    1   265
---------------------------------
            physical writes direct                            558
            sorts (memory)                                      1
            sorts (rows)                                2,541,146

264/1     1 4/4      1    2    30
---------------------------------
            sorts (memory)                                      1
            sorts (rows)                                2,218,809

264/1     1 4/4      1    3    35
---------------------------------
            physical writes direct                          7,110
            physical writes direct temporary tablespace     7,110
            sorts (disk)                                        1
            sorts (rows)                                2,886,184

264/1     1 4/4      1    4   270
---------------------------------
            sorts (memory)                                      1
            sorts (rows)                                2,353,861

Not only did Oracle pass 10M rows from one slave set to the other, the receiving slave set sorted those rows before discarding them. One of the slaves even ran short of memory and spilled its sort to disc to do the sort. And we can see (physical writes direct = 558) that one slave set was responsible for handling all the “real” data for that index.

 

Update 2

A couple of follow-ups on the thread have introduced some other material that’s worth reading.  An item from Mohamed Houri about what happens when a parallel slave is still assigned to an executing statement but isn’t given any work to do for a long time; and an item from Stefan Koehler about _px_trace and tracking down why the degree of parallelism of a statement was downgraded.

May 12, 2015

Parallel Query

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 7:22 pm BST May 12,2015

According to the Oracle Database VLDB and Partitioning Guide (10g version and 11g version):

A SELECT statement can be executed in parallel only if the following conditions are satisfied:

  • The query includes a parallel hint specification (PARALLEL or PARALLEL_INDEX) or the schema objects referred to in the query have a PARALLEL declaration associated with them.
  • At least one table specified in the query requires one of the following:
    • A full table scan
    • An index range scan spanning multiple partitions
  • No scalar subqueries are in the SELECT list.

Note, particularly, that last restriction. I was looking at a query recently that seemed to be breaking this rule so, after examining the 10053 trace file for a while, I decided that I would construct a simplified model of the client’s query to demonstrate how the manuals can tell you the truth while being completely deceptive or (conversely) be wrong while still giving a perfectly correct impression. So here’s a query, with execution plan, from 11.2.0.4:

select
        /*+ parallel(t1 2) */
        d1.small_vc,
        t1.r1,
        t2.n21,
        t2.v21,
        t3.v31,
        (select max(v1) from ref1 where n1 = t2.n21)    ref_t2,
        (select max(v1) from ref2 where n1 = t1.r1)     ref_t1,
        t1.padding
from
        driver          d1,
        t1, t2, t3
where
        d1.n1 = 1
and     t1.n1 = d1.id
and     t1.n2 = 10
and     t1.n3 = 10
and     t2.id = t1.r2
and     t3.id = t1.r3
;

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 15700 |  1340   (3)| 00:00:07 |        |      |            |
|   1 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   2 |   TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  3 |    INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   4 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   5 |   TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  6 |    INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   7 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   8 |   PX SEND QC (RANDOM)        | :TQ10003 |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | P->S | QC (RAND)  |
|*  9 |    HASH JOIN                 |          |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 10 |     HASH JOIN                |          |   100 | 14700 |  1317   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 11 |      HASH JOIN               |          |   100 | 13300 |  1294   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  12 |       BUFFER SORT            |          |       |       |            |          |  Q1,03 | PCWC |            |
|  13 |        PX RECEIVE            |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |         PX SEND BROADCAST    | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 15 |          TABLE ACCESS FULL   | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  16 |       PX BLOCK ITERATOR      |          |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWC |            |
|* 17 |        TABLE ACCESS FULL     | T1       |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  18 |      BUFFER SORT             |          |       |       |            |          |  Q1,03 | PCWC |            |
|  19 |       PX RECEIVE             |          | 10000 |   136K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  20 |        PX SEND BROADCAST     | :TQ10001 | 10000 |   136K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  21 |         TABLE ACCESS FULL    | T2       | 10000 |   136K|    23   (5)| 00:00:01 |        |      |            |
|  22 |     BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|  23 |      PX RECEIVE              |          | 10000 |    97K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  24 |       PX SEND BROADCAST      | :TQ10002 | 10000 |    97K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  25 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    23   (5)| 00:00:01 |        |      |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=:B1)
   6 - access("N1"=:B1)
   9 - access("T3"."ID"="T1"."R3")
  10 - access("T2"."ID"="T1"."R2")
  11 - access("T1"."N1"="D1"."ID")
  15 - filter("D1"."N1"=1)
  17 - filter("T1"."N2"=10 AND "T1"."N3"=10)

Thanks to my hint the query has been given a parallel execution plan – and a check of v$pq_tqstat after running the query showed that it had run parallel. Note, however, where the PX SEND QC and PX COORDINATOR operations appear – lines 7 and 8, and above those lines we see the two scalar subqueries.

This means we’re running the basic select statement as a parallel query but the query co-ordinator has serialised on the scalar subqueries in the select list.  Is the manual “right but deceptive” or “wrong but giving the right impression” ?  Serialising on (just) the scalar subqueries can have a huge impact on the performance and effectively make the query behave like a serial query even though, technically, the statement has run as a parallel query.

You may recall that an example of this type of behaviour, and its side effects when the scalar subqueries executed independently as parallel queries, showed up some time ago. At the time I said I would follow up with a note about the change in behaviour in 12c; this seems to be an appropriate moment to show the 12c plan(s), first the default:


----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 19100 |  1364   (3)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10005 |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED        |          |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  4 |     HASH JOIN OUTER          |          |   100 | 18100 |  1340   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  5 |      HASH JOIN               |          |   100 | 16400 |  1335   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  6 |       HASH JOIN OUTER        |          |   100 | 15000 |  1311   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  7 |        HASH JOIN             |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|   8 |         PX RECEIVE           |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,05 | PCWP |            |
|   9 |          PX SEND BROADCAST   | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | S->P | BROADCAST  |
|  10 |           PX SELECTOR        |          |       |       |            |          |  Q1,00 | SCWC |            |
|* 11 |            TABLE ACCESS FULL | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | SCWP |            |
|  12 |         PX BLOCK ITERATOR    |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWC |            |
|* 13 |          TABLE ACCESS FULL   | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|  14 |        PX RECEIVE            |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  15 |         PX SEND BROADCAST    | :TQ10001 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | S->P | BROADCAST  |
|  16 |          PX SELECTOR         |          |       |       |            |          |  Q1,01 | SCWC |            |
|  17 |           VIEW               | VW_SSQ_1 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  18 |            HASH GROUP BY     |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  19 |             TABLE ACCESS FULL| REF2     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,01 | SCWP |            |
|  20 |       PX RECEIVE             |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  21 |        PX SEND BROADCAST     | :TQ10002 | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | S->P | BROADCAST  |
|  22 |         PX SELECTOR          |          |       |       |            |          |  Q1,02 | SCWC |            |
|  23 |          TABLE ACCESS FULL   | T2       | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | SCWP |            |
|  24 |      PX RECEIVE              |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  25 |       PX SEND BROADCAST      | :TQ10003 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | S->P | BROADCAST  |
|  26 |        PX SELECTOR           |          |       |       |            |          |  Q1,03 | SCWC |            |
|  27 |         VIEW                 | VW_SSQ_2 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  28 |          HASH GROUP BY       |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  29 |           TABLE ACCESS FULL  | REF1     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,03 | SCWP |            |
|  30 |     PX RECEIVE               |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  31 |      PX SEND BROADCAST       | :TQ10004 | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | S->P | BROADCAST  |
|  32 |       PX SELECTOR            |          |       |       |            |          |  Q1,04 | SCWC |            |
|  33 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | SCWP |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T3"."ID"="T1"."R3")
   4 - access("ITEM_2"(+)="T2"."N21")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("ITEM_1"(+)="T1"."R1")
   7 - access("T1"."N1"="D1"."ID")
  11 - filter("D1"."N1"=1)
  13 - filter("T1"."N2"=10 AND "T1"."N3"=10)

The first thing to note is the location of the PX SEND QC and PX COORDINATOR operations – right at the top of the plan: there’s no serialisation at the query coordinator. Then we spot the views at operations 17 and 27 – VW_SSQ_1, VW_SSQ_2 (would SSQ be “scalar subquery”, perhaps). The optimimzer has unnested the scalar subqueries out of the select list into the join. When a scalar subquery in the select list returns no data it’s value is deemed to be NULL so the joins (operations 4 and 6) have to be outer joins.

You’ll notice that there are a lot of PX SELECTOR operations – each feeding a PX SEND BROADCAST operations that reports its “IN-OUT” column as S->P (i.e. serial to parallel). Historically a serial to parallel operation started with the query coordinator doing the serial bit but in 12c the optimizer can dictate that one of the PX slaves should take on that task (see Randolf Geist’s post here). Again my code to report v$pq_tqstat confirmed this behaviour in a way that we shall see shortly.

This type of unnesting is a feature of 12c and in some cases will be very effective. It is a cost-based decision, though, and the optimizer can make mistakes; fortunately we can control the feature. We could simply set the optimizer_features_enable back to 11.2.0.4 (perhaps through the hint) and this would take us back to the original plan, but this isn’t the best option in this case. There is a hidden parameter _optimizer_unnest_scalar_sq enabling the feature so we could, in principle, disable just the one feature by setting that parameter to false; a more appropriate strategy would simply be to tell the optimizer that it should not unnest the subqueries. In my case I could put the /*+ no_unnest */ hint into both the subqueries or use the qb_name() hint to give the two subquery blocks names, and then used the /*+ no_unnest() */ hint with the “@my_qb_name” format at the top of the main query. Here’s the execution plan I get whether I use the hidden parameter or the /*+ no_unnest */ mechanim:

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |       |       |  1554 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                 |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10003 |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    EXPRESSION EVALUATION        |          |       |       |            |          |  Q1,03 | PCWC |            |
|*  4 |     HASH JOIN                   |          |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  5 |      HASH JOIN                  |          |   100 | 14700 |  1330   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  6 |       HASH JOIN                 |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|   7 |        BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|   8 |         PX RECEIVE              |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|   9 |          PX SEND BROADCAST      | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL     | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  11 |        PX BLOCK ITERATOR        |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWC |            |
|* 12 |         TABLE ACCESS FULL       | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|  13 |       BUFFER SORT               |          |       |       |            |          |  Q1,03 | PCWC |            |
|  14 |        PX RECEIVE               |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  15 |         PX SEND BROADCAST       | :TQ10001 | 10000 |   136K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  16 |          TABLE ACCESS FULL      | T2       | 10000 |   136K|    24   (5)| 00:00:01 |        |      |            |
|  17 |      BUFFER SORT                |          |       |       |            |          |  Q1,03 | PCWC |            |
|  18 |       PX RECEIVE                |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  19 |        PX SEND BROADCAST        | :TQ10002 | 10000 |    97K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  20 |         TABLE ACCESS FULL       | T3       | 10000 |    97K|    24   (5)| 00:00:01 |        |      |            |
|  21 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  22 |      TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 23 |       INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|  24 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  25 |      TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 26 |       INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."ID"="T1"."R3")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("T1"."N1"="D1"."ID")
  10 - filter("D1"."N1"=1)
  12 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N2"=10 AND "T1"."N3"=10))
  23 - access("N1"=:B1)
  26 - access("N1"=:B1)

Note particularly that the PX SEND QC and PX COORDINATOR operations are operations 2 and 1,, and we have a new operator EXPRESSSION EVALUATION at operation 3. This has three child operations – the basic select starting at operation 4, and the two scalar subqueries starting at lines 21 and 24. We are operating the scalar subqueries as correlated subqueries, but we don’t leave all the work to the query coordinator – each slave is running its own subqueries before forwarding the final result to the coordinator. There is a little side effect that goes with this change – the “serial to parallel” operations are now, as they always used to be, driven by the query co-ordinator, the PX SELECTOR operations have disappeared.

And finally

Just to finish off, let’s take a look at the results from v$pq_tqstat in 12.1.0.2. First from the default plan with the PX SELECTOR operations. Remember that this turned into a five table join where two of the “tables” were non-correlated aggregate queries against the reference tables.


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P002                   200       2428          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                   100       1238         59         27           0
                                             1 P001                   100       1238         41         24           0

                    1 Producer               1 P002                  2000      23830          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                  1000      11939         57         26           0
                                             1 P001                  1000      11939         41         24           0

                    2 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     339732          0          0           0
                      Consumer               1 P000                 10000     169890         49         22           0
                                             1 P001                 10000     169890         31         21           0

                    3 Producer               1 P002                     0         48          0          0           0
                                             1 P003                  2000      23830          0          0           0
                      Consumer               1 P000                  1000      11939         58         26           0
                                             1 P001                  1000      11939         38         23           0

                    4 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     239986          0          0           0
                      Consumer               1 P000                 10000     120017         50         22           0
                                             1 P001                 10000     120017         34         21           0

                    5 Producer               1 P000                     1        169          0          0           0
                                             1 P001                     1        169          1          0           0
                      Consumer               1 QC                       2        338          3          0           0

As you read down the table queues you can see that in the first five table queues (0 – 4) we seem to operate parallel to parallel, but only one of the two producers (p002 and p003) produces any data at each stage. A more traditional plan would show QC as the single producer in each of these stages.

Now with scalar subquery unnesting blocked – the plan with the three table join and EXPRESSION EVALUATION – we see the more traditional serial to parallel, the producer is QC in all three of the first table queues (the full scan and broadcast of tables t1, t2, and t3).

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 QC                     200       1726          0          0           0
                      Consumer               1 P000                   100       1614         28         15           0
                                             1 P001                   100       1614         34         13           0

                    1 Producer               1 QC                   20000     339732          0          0           0
                      Consumer               1 P000                 10000     169866         19         10           0
                                             1 P001                 10000     169866         25          8           0

                    2 Producer               1 QC                   20000     239986          0          0           0
                      Consumer               1 P000                 10000     119993         23         11           0
                                             1 P001                 10000     119993         31         11           0

                    3 Producer               1 P000                     1        155          1          0           0
                                             1 P001                     1        155          0          0           0
                      Consumer               1 QC                       2        310          3          1           0

It’s an interesting point that this last set of results is identical to the set produced in 11g – you can’t tell from v$pq_tqstat whether the parallel slaves or the query co-ordinator executed the subqueries – you have to look at the output from SQL trace (or similar) to see the individual Rowsource Executions Statistics for the slaves and coordinator to see which process actually ran the subqueries.

 

May 11, 2015

Parallel Execution

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 10:16 am BST May 11,2015

This is another little reference list I should have created some time ago. It covers a series of posts on interpreting parallel execution plans and understanding where the work happens.

I may add further links to this page in the future relating to other aspects of parallel execution.

 

February 12, 2015

Parallel rownum

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 7:27 am BST Feb 12,2015

It’s easy to make mistakes, or overlook defects, when constructing parallel queries – especially if you’re a developer who hasn’t been given the right tools to make it easy to test your code. Here’s a little trap I came across recently that’s probably documented somewhere, which could be spotted easily if you had access to the OEM SQL Monitoring screen, but would be very easy to miss if you didn’t check the execution plan very carefully. I’ll start with a little script to generate some data:


create table t1 nologging
as
select * from all_objects where rownum <= 50000
;

insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

create table t2 as select * from t1;
alter table t2 add id number(10,0);

All I’ve done is create some data – 800,000 rows – and then create a table to copy it to; and while I copy it I’m going to add a temporary id to the rows, which I’ll do with a call to rownum; and since there’s a lot of data I’ll use parallel execution:


alter session enable parallel dml;

insert /*+ parallel(t2 3) */ into t2
select /*+ parallel(t1 4) */ t1.* , rownum from t1;

For the purposes of experiment and entertainment I’ve done something a little odd by supplying two possible degrees of parallelism, but this lets me ask the query: will this statement run parallel 3, parallel 4, both of the above, or neither ? (You may assume that I have parallel execution slaves available when the statement runs.)

The answer is both – because that rownum does something nasty to the execution plan (I didn’t include the 50,000 limit in my first test, which is why the plan reports 993K rows instead of 800,000):


--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT           |          |   993K|    92M|  1076   (1)| 00:00:13 |        |      |            |
|   1 |  PX COORDINATOR            |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)      | :TQ20001 |   993K|    92M|  1076   (1)| 00:00:13 |  Q2,01 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT          | T2       |       |       |            |          |  Q2,01 | PCWP |            |
|   4 |     PX RECEIVE             |          |   993K|    92M|  1076   (1)| 00:00:13 |  Q2,01 | PCWP |            |
|   5 |      PX SEND ROUND-ROBIN   | :TQ20000 |   993K|    92M|  1076   (1)| 00:00:13 |        | S->P | RND-ROBIN  |
|   6 |       COUNT                |          |       |       |            |          |        |      |            |
|   7 |        PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   8 |         PX SEND QC (RANDOM)| :TQ10000 |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | P->S | QC (RAND)  |
|   9 |          PX BLOCK ITERATOR |          |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL| T1       |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------

See that “P->S” (parallel to serial) at operation 8. The select statement runs in parallel (degree 4) to scan the data, and then sends it all to the query co-ordinator to supply the rownum; then the query co-ordinator re-distributes the data (including rownum) to another set of slave (S->P) to do the parallel (degree 3) insert. The P->S at line 2 shows the parallel execution slaves passing details to the query co-ordinator of the private segments that they have created so that the query co-ordinator can stitch the segments together into a single data segment for the table. (If you watch closely you’ll see the query co-ordinator doing a few local writes as it tidies up the header blocks in those segment blocks.)

There are two threats to this rownum detail. The first, of course, is that the operation essentially serialises through the query co-ordinator so it’s going to take longer than you might expect; secondly an accident of this type is typically going to allocate twice as many parallel execution slaves as you might have expected – the select and the insert are two separate data flow operations (note how the Name column shows TQ1xxxx and TQ2xxxx), each gets its own slave sets, and both sets of slaves are held for the duration of the statement. If this statement is demanding twice the slaves it should be using, then you may find that other statements that start running at the same time get their degree of parallelism downgraded because you’ve run out of PX slaves. Although the rownum solution is nice and clean – it require no further infrastructure – you probably need to introduce a sequence (with a large cache) to get the same effect without losing parallelism.

If you look at v$pq_tqstat after running this statement the results are a little disappointing – there are a few problems connecting lines from the plan with rows in the view – here’s my original output (and you’ll now see why I chose to have two different degrees of parallelism):


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Consumer               1 P000                331330   39834186         74         71           0
                                             1 P001                331331   39844094         75         72           0
                                             1 P002                330653   39749806         74         71           0

                    1 Producer               1 P000                     1        131       2263        396           0
                                             1 P001                     1        131       2238        417           0
                                             1 P002                     1        131       2182        463           0

         2          0 Producer               1 P003                247652   28380762         13          0           0
                                             1 P004                228857   26200574         13          1           0
                                             1 P005                267348   30496182         14          0           0
                                             1 P006                249457   28401982         13          0           0
                                             1 QC                  993314  119428086 4294967269 4294967286           0
                      Consumer               1 QC                  993314  113479500        125         65           0

                    1 Consumer               1 QC                       3        393          2          1           0

The first problem is that the DFO_number reported in the view doesn’t match with the :TQ1xxxx and :TQ2xxxx reported in the plan – the parallel 4 bit is the select, which is covered by :TQ1000, but it’s listed under DFO_Number = 2 in the view, and the insert is the parallel 3 bit, which is covered by :TQ2000 and :TQ20001 but listed under DFO_Number = 1.

More confusingly, potentially, is that the all appearances of the query coordinator have been assigned to DFO_Number = 2. Ignoring the fact that the DFO_Number column switches the 1 and 2 from the plan, what we should see is as follows:

  • The consumer at line 16 is consuming from the 4 producers at lines 11 – 14.
  • The producer at line 15 is producing FOR the 3 consumers at lines 3 – 5
  • The consumer at line 18 is consuming from the producers at lines 7 – 9

Ideally (including the correction for the DFO_Number) I think the view content should be as follows:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P003                247652   28380762         13          0           0
                                             1 P004                228857   26200574         13          1           0
                                             1 P005                267348   30496182         14          0           0
                                             1 P006                249457   28401982         13          0           0
                      Consumer               1 QC                  993314  113479500        125         65           0

         2          0 Producer               1 QC                  993314  119428086 4294967269 4294967286           0
                      Consumer               1 P000                331330   39834186         74         71           0
                                             1 P001                331331   39844094         75         72           0
                                             1 P002                330653   39749806         74         71           0

                    1 Producer               1 P000                     1        131       2263        396           0
                                             1 P001                     1        131       2238        417           0
                                             1 P002                     1        131       2182        463           0
                      Consumer               1 QC                       3        393          2          1           0

Just don’t ask me why the waits and timeouts for the QC as producer seem to be counting backwards from 2^32.

November 21, 2014

Parallel Costs

Filed under: CBO,Oracle,Parallel Execution — Jonathan Lewis @ 11:36 am BST Nov 21,2014

While creating a POC of a SQL rewrite recently I received a little surprise as I switched my query from serial execution to parallel execution and saw the optimizer’s estimated cost increase dramatically. I’ll explain why in a moment, but it made me think it might be worth setting up a very simple demonstration of the anomaly. I created a table t1 by copying view all_source – which happened to give me a table with about 100,000 rows and 1117 blocks – and then ran the query ‘select max(line) from t1;’ repeating the query with a /*+ parallel(t1 2) */ hint. From 11.2.0.4 here are the two execution plans I got:


---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   153   (4)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 99173 |   387K|   153   (4)| 00:00:01 |
---------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     4 |  1010   (1)| 00:00:05 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     4 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     4 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     4 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 99173 |   387K|  1010   (1)| 00:00:05 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 99173 |   387K|  1010   (1)| 00:00:05 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

It seems that the optimizer thinks that running the query parallel 2 will take five times as long as running it serially ! Is it, perhaps, some cunning fix to the optimizer that is trying to allow for process startup time for the parallel execution slaves ? Is it a bug ? No – it’s just that I hadn’t got around to setting my system stats and they were complete garbage thanks to various other tests I had been running over the previous couple of weeks. Critically, I had not adjusted the “parallel-oriented” statistics to be consistent with the “serial-oriented” stats.

Here, from the 10053 trace file for the parallel run, is the section on the Single Table Access costing, together with the system stats and base statistical information:


SYSTEM STATISTICS INFORMATION
-----------------------------
  Using WORKLOAD Stats
  CPUSPEED: 1110 millions instructions/sec
  SREADTIM: 4.540000 milliseconds
  MREADTIM: 12.440000 millisecons
  MBRC:     21 blocks
  MAXTHR:   3000000 bytes/sec
  SLAVETHR: 1000000 bytes/sec

***************************************
BASE STATISTICAL INFORMATION
***********************
Table Stats::
  Table: T1  Alias: T1
    #Rows: 99173  #Blks:  1117  AvgRowLen:  76.00  ChainCnt:  0.00
Access path analysis for T1
***************************************
SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T1[T1]
  Table: T1  Alias: T1
    Card: Original: 99173.000000  Rounded: 99173  Computed: 99173.00  Non Adjusted: 99173.00
  Access Path: TableScan
    Cost:  151.13  Resp: 1010.06  Degree: 0
      Cost_io: 147.00  Cost_cpu: 20826330
      Resp_io: 1007.76  Resp_cpu: 11570183

I’m going to walk through the optimizer’s calculations that got the serial I/O cost (cost_io: 147.00) and the parallel I/O cost (Resp_io: 1007.76), but before I do that I’ll point out how inconsistent some of the system stat are. The multiblock read time (mreadtim) is 12.44 milliseconds, to read an average of 21 blocks (mbrc) which, converting to bytes per second means (21 * 8192) * 1000/12.44 = 13,828,938 bytes per second; but the I/O rate for a single parallel execution slave (slavethr) is only 1,000,000 bytes per second – which means a single serial session can (apparently) tablescan nearly 14 times as fast as an individual parallel execution slave. It’s not surprising that somehow the optimizer thinks a serial tablescan will be faster than parallel 2) – but let’s check exactly how the arithmetic falls out.

Serial:

  • #Blks: 1117, MBRC = 21 => read requests = 1117/21 = 53.19
  • sreadtim = 4.54 milliseconds, mreadtim = 12.44 milliseconds = 2.74 * sreadtim
  • Cost = 53.19 * 2.74 + 1 (_tablescan_cost_plus_one = true) = 146.74 — which looks close enough.

Parallel:

  • #Blks: 1117, block size = 8KB => total I/O requirement = 9,150,464 bytes
  • slavethr: 1,000,000 bytes/sec, degree 2 => total throughput 2,000,000 bytes/sec => elapsed I/O time = 4.575232 seconds
  • sreadtim = 4.54 milliseconds => cost = 4.575232 / 0.00454 = 1007.76 QED.

Two final thoughts:

First, if I increase the degree of parallelism to 3 the cost drops to 673 (671.84 plus a bit of CPU); if I increase the degree any further the cost doesn’t drop any further – that’s because the maxthr (maximum throughput) is 3,000,000. The optimizer uses the smaller of maxthr and (degree * slavethr) in the parallel arithmetic.

Secondly, and most importantly, the way the optimizer behaves will be affected by the various parameters relating to parallelism, so here’s a list of the relevant settings on the instance I was using when creating this example:


SQL> show parameter parallel

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
fast_start_parallel_rollback         string      LOW
parallel_adaptive_multi_user         boolean     TRUE
parallel_automatic_tuning            boolean     FALSE
parallel_degree_limit                string      CPU
parallel_degree_policy               string      MANUAL
parallel_execution_message_size      integer     16384
parallel_force_local                 boolean     FALSE
parallel_instance_group              string
parallel_io_cap_enabled              boolean     FALSE
parallel_max_servers                 integer     80
parallel_min_percent                 integer     0
parallel_min_servers                 integer     0
parallel_min_time_threshold          string      AUTO
parallel_server                      boolean     FALSE
parallel_server_instances            integer     1
parallel_servers_target              integer     32
parallel_threads_per_cpu             integer     2
recovery_parallelism                 integer     0

Note, particularly, that I have not enabled parallel_automatic_tuning.

For further details on parallel costs and the cost/time equivalence that the optimizer uses in recent versions, here’s a link to an article by Randolf Geist.

November 12, 2014

Parallel Fun

Filed under: Execution plans,Oracle,Parallel Execution,subqueries — Jonathan Lewis @ 4:42 pm BST Nov 12,2014

As I write, there’s an ongoing thread on Oracle-L that started with the (paraphrased) question: “I’ve got this query that returns 7 million rows; when I change it to ‘select count(*)’ it returns in 4 seconds but when we display the full result set on screen it takes hours, and every second or two the screen pauses; how do I make it go faster.”

The general rapid response was: “You shouldn’t be running 7M rows to a screen – the time is the time for the network traffic and display.”

The first part of the statement is right – the second part is quite likely to be wrong and there’s a very strong hint in the question that makes me say that, it’s the “pauses every second or two”. Of course we don’t know what the OP isn’t telling us, and we don’t know how accurate he is in what he is telling us, so any ideas we have may be completely wrong. For example, we haven’t been given any idea of how long a “pause” is, we don’t really know how accurate that “second or two” might be and whether “every” is an exaggeration, and maybe the query is returning CLOB columns (and that could make a big difference to what you can do to improve performance).

If we take the statement at face value, though, there is one very obvious inference: although some of the time will be due to network traffic time, most of the time is probably due to Oracle doing something expensive for a significant fraction of the rows returned. The pattern of activity probably looks like this:

  • client: call server to fetch next array of rows
  • server: spend some time populating array  — this is where the client sees a pause
  • client: display result array
  • client: call server to fetch next array of rows
  •  etc…

Here’s a trivial example:

connect / as sysdba
set arraysize 500
set pagesize 40

select
        o1.spare1 ,
        (
        select  max((ctime))
        from    obj$    o2
        where   o2.owner# = o1.owner#
        and     o2.obj# < o1.obj#
        ) ct
from obj$ o1
;

On my laptop, running an instance of 11.2.0.4 with about 80,000 rows in obj$ (and a lot of them owned by SYS), I can count seconds and find that (approximately) I alternate between one second watching results scrolling up the screen and one second waiting as the server generates the next 500 rows.

Of course it’s possible to argue that the problem really is the network and nothing but the network struggling to cope with the never-ending stream of little packets produced by 7M rows. Could there be a choke point that causes the data to stop and start with great regularity, maybe – but previous experience says probably not. I have experienced bad network problems in the past, but when they’ve occurred I’ve always observed extremely random stop/go behaviour. The regularity implied in the question makes the Oracle-based problem seem far more likely.

Conveniently a couple of people asked for more clues – like the query text and the execution plan; even more conveniently the OP supplied the answers in this response. Since the email format makes them a little hard to read I’ve copied them here:


SELECT  bunch of stuff.....,

        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr1.RELATED_SID
                        ||
                        ',')
                ORDER BY sr1.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr1
                WHERE   sr1.SID                    = slv.SID
                        AND sr1.RELATIONSHIP_LEVEL = '1'
                GROUP BY sr1.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL1,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr2.RELATED_SID
                        ||
                        ',')
                ORDER BY sr2.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr2
                WHERE   sr2.SID                    = slv.SID
                        AND sr2.RELATIONSHIP_LEVEL = '2'
                GROUP BY sr2.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL2,
        (
               SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr3.RELATED_SID
                        ||
                        ',')
                ORDER BY sr3.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr3
                WHERE   sr3.SID                    = slv.SID
                        AND sr3.RELATIONSHIP_LEVEL = '3'
                GROUP BY sr3.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL3,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr4.RELATED_SID
                        ||
                        ',')
                ORDER BY sr4.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr4
                WHERE   sr4.SID                    = slv.SID
                        AND sr4.RELATIONSHIP_LEVEL = '4'
                GROUP BY sr4.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL4,
        (
                SELECT  RTRIM(XMLSERIALIZE(CONTENT EXTRACT( XMLAGG(XMLELEMENT("e", sr5.RELATED_SID
                        ||
                        ',')
                ORDER BY sr5.RELATED_SID), '//text()' ) ) , ',' )
                FROM    service_relationship sr5
                WHERE   sr5.SID                    = slv.SID
                        AND sr5.RELATIONSHIP_LEVEL = '5'
                GROUP BY sr5.SID
        ) AS RELATEDSERVICEINSTANCEIDLEVEL5
FROM    service_lookup slv
        LEFT JOIN service_location sl
        ON      sl.service_location_id = slv.service_location_id;

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1570133209

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                      |  7331K|  5593M|  1877   (5)| 00:00:01 |        |      |            |
|   1 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|   2 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10000             |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   4 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | PCWC |            |
|*  5 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q1,00 | PCWP |            |
|   6 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|   7 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|   8 |    PX SEND QC (RANDOM)           | :TQ20000             |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | P->S | QC (RAND)  |
|   9 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | PCWC |            |
|* 10 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q2,00 | PCWP |            |
|  11 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  12 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  13 |    PX SEND QC (RANDOM)           | :TQ30000             |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | P->S | QC (RAND)  |
|  14 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | PCWC |            |
|* 15 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q3,00 | PCWP |            |
|  16 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  17 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  18 |    PX SEND QC (RANDOM)           | :TQ40000             |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | P->S | QC (RAND)  |
|  19 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | PCWC |            |
|* 20 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q4,00 | PCWP |            |
|  21 |  SORT GROUP BY                   |                      |     1 |    22 |   368   (6)| 00:00:01 |        |      |            |
|  22 |   PX COORDINATOR                 |                      |       |       |            |          |        |      |            |
|  23 |    PX SEND QC (RANDOM)           | :TQ50000             |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | P->S | QC (RAND)  |
|  24 |     PX BLOCK ITERATOR            |                      |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | PCWC |            |
|* 25 |      TABLE ACCESS STORAGE FULL   | SERVICE_RELATIONSHIP |    25 |   550 |   368   (6)| 00:00:01 |  Q5,00 | PCWP |            |
|  26 |  PX COORDINATOR                  |                      |       |       |            |          |        |      |            |
|  27 |   PX SEND QC (RANDOM)            | :TQ60002             |  7331K|  5593M|  1877   (5)| 00:00:01 |  Q6,02 | P->S | QC (RAND)  |
|* 28 |    HASH JOIN RIGHT OUTER BUFFERED|                      |  7331K|  5593M|  1877   (5)| 00:00:01 |  Q6,02 | PCWP |            |
|  29 |     PX RECEIVE                   |                      |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,02 | PCWP |            |
|  30 |      PX SEND HASH                | :TQ60000             |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | P->P | HASH       |
|  31 |       PX BLOCK ITERATOR          |                      |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | PCWC |            |
|  32 |        TABLE ACCESS STORAGE FULL | SERVICE_LOCATION     |  3175K|   920M|   366   (3)| 00:00:01 |  Q6,00 | PCWP |            |
|  33 |     PX RECEIVE                   |                      |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,02 | PCWP |            |
|  34 |      PX SEND HASH                | :TQ60001             |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | P->P | HASH       |
|  35 |       PX BLOCK ITERATOR          |                      |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | PCWC |            |
|  36 |        TABLE ACCESS STORAGE FULL | SERVICE_LOOKUP       |  7331K|  3467M|  1507   (5)| 00:00:01 |  Q6,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------------------

We have a simple two-table outer join, and five scalar subqueries in the select list. (Not being very familiar with the various XML calls I had no idea of what the scalar subqueries were doing, or how they produced a result, beyond the fact that they were querying and aggregating multiple rows. In fact the combination of calls does much the same as listagg(), though it allows for a CLOB result (which could be part of the performance problem, of course) rather than being limited to a varchar2() result).

Would you like to guess at this point why I constructed my demonstration query again obj$ the way I did when presenting the idea of high-cost per row queries as a reason for regular pauses in the output ? The execution plan matched one of my two initial guesses about what the query was going to look like. When you “select count(*) from {this query}”, the optimizer will factor out the scalar subqueries and only have to count the result set from the hash join – and it might even manage to use a couple of parallel index fast full scans to get that result rather than doing the tablescans. When you run the query you have to run the scalar subqueries.

If we trust the statistics, we have 5 subqueries to run for each row of the hash join – and the hash join is predicted to return 7.3 million rows. Given that the subqueries are all going to run parallel tablescans against a fairly large table (note – the cost of the tablescans on SERVICE_RELATIONSHIP is 368, compared to the cost of the tablescan on SERVICE_LOCATION which is 366 to return 3.1M rows) that’s an awful lot of work for each row returned – unless we benefit from an enormous amount of scalar subquery caching.

Here’s another performance threat that the plan shows, though: notice where the PX SEND QC operation appears – that means the PX slaves send their (7M) rows to the Query Co-ordinator and the QC is responsible for doing all the work of running the scalar subqueries. Another interesting little threat visible in the plan shows up in the TQ column – the plan uses six “data flow operations” (using the original naming convention, though that changed some time ago but survived in the column names of v$pq_tqstat). In principle each DFO could allocate two sets of PX slaves (and every DFO could have a different degree of parallelism); in this example DFO number 6 (the driving hash join) uses two sets of slave, and the other five DFOs (the scalar subqueries) use a single set each. The upshot of this is that if the default degree of parallelism in play is N this query will allocate 7N parallel query slaves. It gets a little nastier than that, though (based on checking the output from v$sql_plan_monitor), because each time one of the scalar subqueries runs Oracle seems to allocate and deallocate the slaves that are supposed to run it – which is probably going to cause some contention if there are other parallel queries trying to run at the same time.

Optimisation

So what could you do with this query ? It depends on how much change you want to make to the code.

It’s possible that an index on service_relationship(relationship_level, sid) – with compress 1 – might help if it’s very precise, and if the target table stays in the buffer cache for the duration of the query – but, in the absence scalar subquery caching that could still leave the query co-ordinator executing 35 million (5 queries x 7 million rows) subqueries in a serialised process.

A better bet may be to convert from subqueries to joins – remembering that the listagg() / xmlserialize() calls will require you to aggregate (which means sorting in this case) an estimated 25 rows per driving row per relationship_level; in other words you may need to sort 7M * 125 = 875M rows – but at least you could do that in parallel, and there’s always the possibility that the estimated 25 drops off as you work through the different levels. You could choose to do 5 outer hash joins or (as Iggy Fernandez outlined in the thread) you could do a single outer join with a decode on the relationship_level. Another variation on this theme (which would probably have a plan showing ‘join then aggregate’) would be to ‘aggregate then join’. It’s possible that creating a non-mergeable inline view for the 5 values of relationsip_level from a single table access, aggregating it to produce the five required columns, then using the result in an outer join, would be the most efficient option. In the absence of a detailed understanding of the data volume and patterns it’s hard to make any prediction of which strategy would work best.

Footnote:

I may be wrong in my analysis of this problem. When I first saw the question the reason for the performance pattern suggested an “obvious” design error in either the SQL or the infrastructure, and when I saw that the query and execution plan matched my prediction it became very hard for me to think that there might be some other significant cause.

There were a couple of interesting details in the execution plan that made me pursue the problem a little more. In the first case I built a very simple model to get an estimate of the time needed to display 7M rows of a reasonable width in SQL*Plus running across a typical LAN (my estimate was in the order of 45 minutes – not hours); then I spent a little more time (about 10 minutes) to build a model that reproduced the key features of the execution plan shown.

I then spent two or three hours playing with the model, and I’ll be writing a further blog with some of the results later on. One detail to carry away today, though, is that in 12c Oracle can do a new form of subquery unnesting which transformed the query from its 5 scalar subquery form into the seven table join form that was one of the suggestions made on the thread; even more interestingly, if I blocked the unnesting (to force the subquery execution) Oracle 12.1.0.2 came up with a new operator (EXPRESSION EVALUATION) that allowed it to run the subqueries from the PX slaves before passing the results to the query co-ordinator – in other words eliminating the serialisation point.

To be continued …

July 27, 2014

Parallel Plans

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 8:39 pm BST Jul 27,2014

This is the directory for a short series I wrote discussing how to interpret parallel execution plans in newer versions of Oracle.

For other aspects of parallel execution the best, and most wide-ranging, source is  Randolf Geist’s blog, which contains so much material that I’ve had to create a catalogue for my own convenience. All the following links point to work by Randof.

Parallel Data Skew

One of the awkward problems you can encounter with parallel execution is data skew – which has the potential to make just one slave in a set do (almost) all the work hence reducing the performance to something close to serial execution times.  Randolf has written a series of articles on Parallel Skew that has been published by AllthingsOracle over the last few months.

XPLAN_ASH

 

Parallel Execution New Features (12c)

 

 

 

 

March 10, 2014

Parallel Execution – 5

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 1:30 pm BST Mar 10,2014

In the last article (I hope) of this series I want to look at what happens when I change the parallel distribution method on the query that I’ve been using in my previous demonstrations.  This was a query first introduced in a note on Bloom Filters (opens in a separate window) where I show two versions of a four-table parallel hash join, one using using the broadcast distribution mechanism throughout, the other using the hash distribution method. For reference you can review the table definitions and plan (with execution stats) for the serial join in this posting (also opens in a separate window).

To change distribution methods from the broadcast example to the hash example I’ve simply changed a few hints in my code. Here are two sets of hints showing what I’ve done; the first is a repeat from the third article showing the broadcast example, the second shows the small change needed to get the hash example:


/*+
    leading(t4 t1 t2 t3)
    full(t4) parallel(t4, 2)
    use_hash(t1) swap_join_inputs(t1) pq_distribute(t1 none broadcast)
    full(t1) parallel(t1, 2)
    use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 none broadcast)
    full(t2) parallel(t2, 2)
    use_hash(t3) swap_join_inputs(t3) pq_distribute(t3 none broadcast)
    full(t3) parallel(t3, 2)
    monitor
*/

/*+
    leading(t4 t1 t2 t3)
    full(t4) parallel(t4, 2)
    use_hash(t1) swap_join_inputs(t1) pq_distribute(t1 hash hash)
    full(t1) parallel(t1, 2)
    use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 hash hash)
    full(t2) parallel(t2, 2)
    use_hash(t3) swap_join_inputs(t3) pq_distribute(t3 hash hash)
    full(t3) parallel(t3, 2)
    monitor
*/

Because of the combination of leading() hint with the use_hash() and swap_join_inputs() hints the plan WILL still build in-memory hash tables from t1, t2, and t3 and it WILL still probe each hash table in turn with the rows (that survive) from t4; but the order of activity in the hash distribution plan will be dramatically different from the order in the serial and parallel broadcast plans where the order in which Oracle actually built the in-memory hash tables was t3, t2, t1.

Here – with a little cosmetic adjustment – is the parallel execution plan using hash distribution on 11.2.0.4, captured from memory with rowsource execution stats enabled (the 12c plan would report PX SEND HYBRID HASH” operators with an associated “STATISTICS COLLECTOR” operator showing that adaptive execution was a possibility – with three points at which the plan might switch from hash distribtion to broadcast):


--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name     | Starts | Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |          |      1 |   437 (100)|          |        |      |            |      1 |00:00:00.08 |      16 |      5 |
|   1 |  SORT AGGREGATE                    |          |      1 |            |          |        |      |            |      1 |00:00:00.08 |      16 |      5 |
|   2 |   PX COORDINATOR                   |          |      1 |            |          |        |      |            |      2 |00:00:00.08 |      16 |      5 |
|   3 |    PX SEND QC (RANDOM)             | :TQ10006 |      0 |            |          |  Q1,06 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |
|   4 |     SORT AGGREGATE                 |          |      2 |            |          |  Q1,06 | PCWP |            |      2 |00:00:00.01 |       0 |      0 |
|*  5 |      HASH JOIN                     |          |      2 |   437   (3)| 00:00:03 |  Q1,06 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|   6 |       JOIN FILTER CREATE           | :BF0000  |      2 |     2   (0)| 00:00:01 |  Q1,06 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|   7 |        PX RECEIVE                  |          |      2 |     2   (0)| 00:00:01 |  Q1,06 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|   8 |         PX SEND HASH               | :TQ10004 |      0 |     2   (0)| 00:00:01 |  Q1,04 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|   9 |          PX BLOCK ITERATOR         |          |      2 |     2   (0)| 00:00:01 |  Q1,04 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 10 |           TABLE ACCESS FULL        | T3       |      2 |     2   (0)| 00:00:01 |  Q1,04 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  11 |       PX RECEIVE                   |          |      2 |   435   (3)| 00:00:03 |  Q1,06 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|  12 |        PX SEND HASH                | :TQ10005 |      0 |   435   (3)| 00:00:03 |  Q1,05 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  13 |         JOIN FILTER USE            | :BF0000  |      2 |   435   (3)| 00:00:03 |  Q1,05 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|* 14 |          HASH JOIN BUFFERED        |          |      2 |   435   (3)| 00:00:03 |  Q1,05 | PCWP |            |    630 |00:00:00.01 |       0 |      0 |
|  15 |           JOIN FILTER CREATE       | :BF0001  |      2 |     2   (0)| 00:00:01 |  Q1,05 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  16 |            PX RECEIVE              |          |      2 |     2   (0)| 00:00:01 |  Q1,05 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  17 |             PX SEND HASH           | :TQ10002 |      0 |     2   (0)| 00:00:01 |  Q1,02 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  18 |              PX BLOCK ITERATOR     |          |      2 |     2   (0)| 00:00:01 |  Q1,02 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 19 |               TABLE ACCESS FULL    | T2       |      2 |     2   (0)| 00:00:01 |  Q1,02 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  20 |           PX RECEIVE               |          |      2 |   432   (3)| 00:00:03 |  Q1,05 | PCWP |            |    632 |00:00:00.01 |       0 |      0 |
|  21 |            PX SEND HASH            | :TQ10003 |      0 |   432   (3)| 00:00:03 |  Q1,03 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  22 |             JOIN FILTER USE        | :BF0001  |      2 |   432   (3)| 00:00:03 |  Q1,03 | PCWP |            |    632 |00:00:00.09 |       0 |      0 |
|* 23 |              HASH JOIN BUFFERED    |          |      2 |   432   (3)| 00:00:03 |  Q1,03 | PCWP |            |  14700 |00:00:00.09 |       0 |      0 |
|  24 |               JOIN FILTER CREATE   | :BF0002  |      2 |     2   (0)| 00:00:01 |  Q1,03 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  25 |                PX RECEIVE          |          |      2 |     2   (0)| 00:00:01 |  Q1,03 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  26 |                 PX SEND HASH       | :TQ10000 |      0 |     2   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  27 |                  PX BLOCK ITERATOR |          |      2 |     2   (0)| 00:00:01 |  Q1,00 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 28 |                   TABLE ACCESS FULL| T1       |      2 |     2   (0)| 00:00:01 |  Q1,00 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  29 |               PX RECEIVE           |          |      2 |   427   (2)| 00:00:03 |  Q1,03 | PCWP |            |  14700 |00:00:00.08 |       0 |      0 |
|  30 |                PX SEND HASH        | :TQ10001 |      0 |   427   (2)| 00:00:03 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  31 |                 JOIN FILTER USE    | :BF0002  |      2 |   427   (2)| 00:00:03 |  Q1,01 | PCWP |            |  14700 |00:00:00.05 |    6044 |   6018 |
|  32 |                  PX BLOCK ITERATOR |          |      2 |   427   (2)| 00:00:03 |  Q1,01 | PCWC |            |  14700 |00:00:00.04 |    6044 |   6018 |
|* 33 |                   TABLE ACCESS FULL| T4       |     26 |   427   (2)| 00:00:03 |  Q1,01 | PCWP |            |  14700 |00:00:00.04 |    6044 |   6018 |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T3"."ID"="T4"."ID3")
  10 - access(:Z>=:Z AND :Z<=:Z)
       filter((TO_NUMBER("T3"."SMALL_VC")=1 OR TO_NUMBER("T3"."SMALL_VC")=2 OR TO_NUMBER("T3"."SMALL_VC")=3))
  14 - access("T2"."ID"="T4"."ID2")
  19 - access(:Z>=:Z AND :Z<=:Z)
       filter((TO_NUMBER("T2"."SMALL_VC")=1 OR TO_NUMBER("T2"."SMALL_VC")=2 OR TO_NUMBER("T2"."SMALL_VC")=3))
  23 - access("T1"."ID"="T4"."ID1")   28 - access(:Z>=:Z AND :Z<=:Z)
       filter((TO_NUMBER("T1"."SMALL_VC")=1 OR TO_NUMBER("T1"."SMALL_VC")=2 OR TO_NUMBER("T1"."SMALL_VC")=3))
  33 - access(:Z>=:Z AND :Z<=:Z)
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T4"."ID1"))

There are a couple of significant points that are very easy to point out in this plan. First, we have a number of lines which are “BLOOM FILTER CREATE/USE” lines that did not appear in the broadcast plan; second that we can only see one sys_op_bloom_filter() in the predicate section rather than three (don’t worry, it’s – partly – a reporting defect); finally we have seven virtual tables (table queues :TQnnnnn) in this plan rather than four, and those virtual tables seems to be scattered rather more randomly around the plan.

To make it easier to understand what’s happened with a parallel execution plan, I usually also dump out the contents of v$pq_tqstat after running the query – so here’s the result after running the above:

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P002                     3         69          1          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                     2         62         30         16           0
                                             1 P001                     1         55         26         14           0

                    1 Producer               1 P002                  1476      35520          2          1           0
                                             1 P003                 13224     317880          1          0           0
                      Consumer               1 P000                  9800     235584         20         14           0
                                             1 P001                  4900     117816         20         14           0

                    2 Producer               1 P000                     3         69          0          0           0
                                             1 P001                     0         48          0          0           0
                      Consumer               1 P002                     2         62         33         19           0
                                             1 P003                     1         55         32         19           0

                    3 Producer               1 P000                   422       9754          0          0           0
                                             1 P001                   210       4878          0          0           0
                      Consumer               1 P002                   420       9708         33         19           0
                                             1 P003                   212       4924         32         18           0

                    4 Producer               1 P002                     3         69          1          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                     2         62         42         20           0
                                             1 P001                     1         55         39         15           0

                    5 Producer               1 P002                    18        444          0          0           0
                                             1 P003                     9        246          0          0           0
                      Consumer               1 P000                    18        444         41         20           0
                                             1 P001                     9        246         39         16           0

                    6 Producer               1 P000                     1         60          0          0           0
                                             1 P001                     1         60          0          0           0
                      Consumer               1 QC                       2        120          1          0           0

So let’s work our way through the execution plan – if you want to put the plan and my comments side by side, this link will re-open this article in a second window.

Given the set of hints, and the intent I expressed at the start of the series, we hope to see Oracle building an in-memory hash table from each of t3, t2 and t1 in that order, following which it will scan t4, probe t1, t2, and t3 in that order, and then aggregate the result.  Let’s check that using the parallel plan rule of “follow the table queues”.

Table queue 0 covers lines 26 – 28, we scan t1 and distribute it by hash.  We can see from the A-Rows column we found 3 rows and distributed them and if we look at the output from v$pq_tqstat we find it matches – slaves 2 and 3 produced 3 rows, slaves 0 and 1 consumed 3 rows. Table queue 1 covers lines 30 – 33, we scan t4 and distribute it by hash. We can see from the A-rows column we found 14,700 rows and distributed them, and again we can see the match in v$pq_tqstat – slaves 2 and 3 produced 14,700 rows and distributed them to slaves 0 and 1. But there’s an oddity here, and things start to  get messy: from the predicate section we can see that we applied a Bloom filter on the ID1 column on the data we got from the tablescan, and the plan itself shows a Bloom filter (:BF0002) being used at line 31, but that Bloom filter is created at line 24 of the plan and line 24 has been associated with table queue 3. Now I know (because I constructed the data) that a perfect filter has been created and used at that point because 14,700 rows is exactly the volume of data that should eventually join between tables t1 and t4.  It’s reasonable, I think, to say that the boundary between table queues 0 and 3 is a little blurred at lines 24/25 – the slaves that are going to populate table queue 3 are the ones that created the Bloom filter, but they’re not going to populate table queue 3 just yet.

So let’s move on to table queue 2. This covers lines 17-19 (looking at the TQ column) except I’m going to assume the same blurring of boundaries I claimed for table queue 0 – I’m going to say that table queue 2 expands into lines 15-19 (bringing in the PX RECEIVE and JOIN FILTER CREATE (:BF001). So our next step is to scan and distribute table t2, and build a Bloom filter from it. Again we look at v$pq_tqstat and see that in this case it’s slaves 0 and 1 which scan the table and distribute 3 rows to slaves 2 and 3, and we assume that slaves 2 and 3 will send a Bloom filter back to salves 0 and 1.

Now we can move on to table queue 3: line 21 writes to table queue 3 by using lines 22, 23, 24, 25, and 29 according to the TQ column (but thanks to the blurring of the boundaries lines 24 and 25 were used “prematurely” to create the Bloom filter :BF002 describing the results from table t1). So lines 24/25 read table queue 0 and built an in-memory hash table, simultaneously creating a Bloom filter and sending it back to slaves 2 and 3; then line 23 did a HASH JOIN BUFFERED, which means it copied the incoming data from table queue 1 (slaves 2 and 3, table t4)  into a buffer and then used that buffer to probe its in-memory hash table and do the join; then line 22 applied a Bloom filter (:BF001) to the result of the hash join although the filter won’t appear in the predicate section until version 12.1.0.1. Notice that line 23 (the join) produced 14,700 rows, demonstrating that our previous filter was a perfect filter, and then line 22 filtered out all but 632 rows. (Again, because I constructed the data I can tell you that the second Bloom filter has also worked with 100% accuracy – although v$pq_tqstat seems to show an extra 2 rows which I can’t account for and which don’t appear in the trace file).

So here’s another problem – we’re using another Bloom filter that we haven’t yet (apparently) created unless we accept my assumption of the blurring of the boundary at lines 15 and 16, where the plan shows two lines associated with table queue 5 even though I need them to be associated with table queue 2 so that they can produce the Bloom filter needed by table queue 3. Again, by the way, we can do the cross-check with the TQ_ID 3 of v$pq_tqstat abnd see slaves 0 and 1 produced 632 rows and sent them to slaves 2 and 3.

Before continuing, lets rewrite the action so far as a series of bullet points:

  • Slaves 2,3 scan t1 and distribute to slaves 0,1
  • Slaves 0,1 build an in-memory hash table and a Bloom filter (:BF002) for t1, and send the filter to slaves 2,3
  • Slaves 2,3 scan t4, use the Bloom filter (:BF002) to eliminate data (luckily 100% perfectly) and distribute the remaining rows to slaves 0,1
  • Slaves 0,1 buffer the incoming data
  • Slaves 0,1 scan t2 and distribute to slaves 2,3
  • Slaves 2,3 build an in-memory hash table for the results from t2 and a Bloom filter (:BF001) for t2, and send the filter to slaves 0,1
  • Slaves 0,1 use the buffered t4 to probe the in-memory hash of t1 to do the join, testing join results  against the Bloom filter (:BF001) for t2, and distributing the surviving rows to slaves 2,3

The pattern of the last four steps will then repeat for the next hash join – and for longer joins the patten will repeat up to, but excluding, the last join.

  • Slaves 2,3 buffer the incoming data (the result of joining t4 and t1) – the buffering is implied by line 14 (which is labelled as an input for table queue 5)
  • Slaves 2,3 scan t3 and distribute to slaves 0,1 (reading lines 8,9,10 of the plan), cross-checking with TQ_ID 4 of v$pq_tqstat
  • Slaves 0,1 build an in-memory hash table for the results from t3 and a Bloom filter (:BF000) for t3, and send the filter to slaves 2,3 (“sharing” lines 6 and 7 from table queue 6)
  • Slaves 2,3 use the buffered results from (t4/t1) to probe the in-memory hash to t2 to do the join, testing join results against the Bloom filter (:BF000) for t3, and distributing the surviving rows to slaves 0,1.

Again, we can check row counts – the hash join buffered at line 14 shows 630 rows coming from the hash join (i.e. the previous Bloom filter was perfect), and line 13 shows 27 rows surviving the final Bloom filter. Again my knowledge of the data tells me that the Bloom filter was a perfect filter. Cross-checking to TQ_ID 5 of v$pq_tqstat we see slaves 2 and 3 producing 27 rows and slaves 0 and 1 consuming them.

So at this point slaves 0,1 have an in-memory hash table for t3, and are receiving the filtered results of the join between t4, t1, and t2; the slaves have to join and aggregate the the two data sets before forwarding a result to the query co-ordinator. Since the aggregation is a blocking operation (i.e. slaves 0,1 can’t start to send data to the co-ordinator until they’ve finished emptying virtual table 5 and have aggregated all the incoming data) they don’t have to use the “hash join buffered” mechanism, so the pattern for the final part of the plan changes.

Lines 5, 6, 7, 11 show us the hash join (not buffered) with its two inputs (although lines 6 and 7 have, of course, been mentioned once already as the source of the Bloom filter used at line 13). Then line 4 shows slaves 0 and 1 aggregating their results; line 3 shows them forwarding the results to the query co-ordinator, line 2 shows the query co-ordinator receiving the results and line 1 shows it aggregating across the slave results ready to send to the end-user.

It’s a bit complicated, and the constant jumping back and fore through the execution plan lines (especially for the “shared” usage of the Bloom filter creation lines) makes it quite hard to follow, so I’ve drawn up a Powerpoint slide to capture the overall picture:
px_plan

I’ve put the slaves 0 and 1 at the top of the picture, slaves 2 and 3 at the bottom, with the query co-ordinator in the middle at the right hand side. Time reads across the page from left to right, and that gives you the order in which data moves through table queues (and back, for Bloom filters). The annotation give you some idea of what data is moving. Note that I’ve used B1 to refer to the Bloom filter on table T1 (and ignored the numbering on Oracle’s :BFnnn entries). I’ve used red to highlight the data sets that are buffered, and put in curved arrows to show where the buffered data is subsequently brought back into play. I did try to add the various plan line numbers to the picture, but the volume of text made the whole thing incomprehensible – so I’ve left it with what I think is the best compromise of textual information and graphical flow.

I’ll just leave one final warning – if you want to reproduce my results, you’ll have to be careful about versions. I stuck with 11.2.0.4 as that’s the latest version of the most popular general release. There are differences in 12.1.0.1, and there are differences again if you try to emulate 11.2.0.4 by setting the optimizer_features_enable parameter in 12.1.0.1 back to the earlier version.

March 5, 2014

12c pq_replicate

Filed under: 12c,Exadata,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 5:39 pm BST Mar 5,2014

Another day, another airport lounge – another quick note: one of the changes that appeared in 12c was a tweak to the “broadcast” distribution option of parallel queries. I mentioned this in a footnote to a longer article a couple of months ago; this note simply expands on that brief comment with an example. We’ll start with a simple two-table hash join – which I’ll first construct and demonstrate in 11.2.0.4:


create table t1
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	rownum				n1,
	lpad(rownum,6,'0')		small_vc,
	lpad(rownum,200,'0')		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1000
;

create table t2
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	1 + mod(rownum,10000)			n1,
	lpad(1 + mod(rownum,10000),6,'0')	small_vc,
	lpad(rownum,500,'0')			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 20000 ;

-- collect stats, no histograms.

select
  	/*+
  		leading(t1 t2)
 		parallel(t1 2)
 		parallel(t2 2)
 		use_hash(t2)
 	*/
 	t1.padding,
 	t2.padding
from 	t1, t2
where	t2.n1 = t1.n1
and	t2.small_vc = t1.small_vc
;

-------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10001 |  1000 |   707K|   135 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN            |          |  1000 |   707K|   135 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |  1000 |   207K|     4 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST  | :TQ10000 |  1000 |   207K|     4 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR |          |  1000 |   207K|     4 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR   |          | 20000 |     9M|   131 |  Q1,01 | PCWC |            |
|   9 |      TABLE ACCESS FULL  | T2       | 20000 |     9M|   131 |  Q1,01 | PCWP |            |
-------------------------------------------------------------------------------------------------

In this plan slave set 2 scans table t1 in parallel and broadcasts the result set to slave set 1 (lines 5 – 7). The significance of the broadcast option is that each slave in slave set 2 sends all the rows it has read to every slave in slave set 1. For a fairly large table with a high degree of parallelism this could be a lot of inter-process communication; the total number of rows passing through the PX message pool is “DOP x number of row filtered from t1″.

After a slave in slave set 1 has receive the whole of the t1 result set it builds an in-memory hash table and starts scanning rowid ranges (PX BLOCK ITERATOR) from table t2, probing the in-memory hash table to effect the join (lines 3,4, 8,9). Since each slave has a copy of the whole result set from t1 it can scan any chunk of t2 and handle the contents locally. Moreover, because slave set 1 isn’t reading its second input from a virtual table it is able to write its output immediately the virtual table (:TQ10001) that feeds the query coordinator with the result (lines 1,2) – we don’t have to do a “hash join buffered” operation and buffer the entire second input before starting to execute the join.

So how does 12c change things. With the same starting data and query, here’s the execution plan:

-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR       |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  1000 |   707K|   135 |  Q1,00 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN          |          |  1000 |   707K|   135 |  Q1,00 | PCWP |            |
|   4 |     TABLE ACCESS FULL | T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   5 |     PX BLOCK ITERATOR |          | 20000 |     9M|   131 |  Q1,00 | PCWC |            |
|   6 |      TABLE ACCESS FULL| T2       | 20000 |     9M|   131 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------

Notice, in particular, that we only have one virtual table (or table queue :TQ10000) rather than two – and that’s from a parallel query slave set to the query co-ordinator, parallel to serial; the query only uses one set of parallel query slaves. Until you run the query with rowsource execution statistics enabled and look at the output from v$pq_tqstat it’s not going to be immediately obvious what has happened, but we should see that somehow Oracle is no longer broadcasting the first table even though it’s still doing something in parallel with both tables.

The run-time statistics confirm that we’ve only used one set of slaves, and each slave in the slave set has scanned the whole of table t1. This means each slave can build the full hash table and then go on to read rowid ranges from table t2. We’ve managed to get the benefit of broadcasting t1 (every slave has the whole of t1 so we don’t have to scan and distribute the big table t2 through the PX message pool) but we haven’t had to clone it multiple times through the PX message pool.

Clearly there’s a trade-off here that Oracle Corp. has decided is worth considering. I’m guessing it’s biased towards Exadata where you might run queries with a very high degree of parallelism. In that case the overhead of task switching as large numbers of messages are passed around may (and this is pure supposition) be greater than the added cost of loading the table into the buffer cache (of each instance) and having each slave scan it from there. (Reminder – 11g introduced two “opposite” changed to tablescans: “serial direct reads” and “parallel in-memory scans”.)

There’s one little oddity in this replication – there’s a pair of hints: pq_replicate and no_pq_replicate to control the effect if you think the optimizer is making the wrong choice. I would have guessed that in my example the hint would read: /*+ pq_replicate(t1) */ as it’s table t1 that is read by every single slave. Strangely, though, this is what the outline section of the execution plan showed:


  /*+
      BEGIN_OUTLINE_DATA
      PQ_REPLICATE(@"SEL$1" "T2"@"SEL$1")
      PQ_DISTRIBUTE(@"SEL$1" "T2"@"SEL$1" BROADCAST NONE)
      USE_HASH(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      OUTLINE_LEAF(@"SEL$1")
      ALL_ROWS
      OPT_PARAM('_optimizer_cost_model' 'io')
      DB_VERSION('12.1.0.1')
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Notice how the hint specifies table t2, not table t1 !

Footnote

Here’s a little anomaly,  and a generic warning about “optimizer_features_enable”: I found that if I used the hint /*+ optimizer_features_enable(‘11.2.0.4′) */ in 12c I could still get the pq_replicate() hint to work. Unfortunately there are a few places where the hint (or parameter) isn’t guaranteed to take the optimizer code backwards the full 100%.

February 28, 2014

Empty Hash

Filed under: Bugs,CBO,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:45 pm BST Feb 28,2014

A little while ago I highlighted a special case with the MINUS operator (that one of the commentators extended to include the INTERSECT operator) relating to the way the second subquery would take place even if the first subquery produced no rows. I’ve since had an email from an Oracle employee letting me know that the developers looked at this case and decided that it wasn’t feasible to address it because – taking a wider view point – if the query were to run parallel they would need a mechanism that allowed some synchronisation between slaves so that every slave could find out that none of the slaves had received no rows from the first subquery, and this was going to lead to hanging problems.

The email reminded me that there’s another issue of the same kind that I discovered several years ago – I thought I’d written it up, but maybe it was on a newsgroup or forum somewhere, I can’t find it on my blog or old website). The problem can be demonstrated by this example:

(more…)

February 26, 2014

Parallel Execution – 4

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 1:58 pm BST Feb 26,2014

I’m aware that in the previous article in this series I said I’d continue “in a few days” and it has now been more like 11 weeks – but finally I’ve got the time. In this article I’m going to talk primarily about Bloom filters and their impact on performance, but I’ll need to say something about the “virtual tables” and “parallel execution message size” before I begin. Take a look at this fragment of a parallel execution plan:

(more…)

December 8, 2013

Parallel Execution – 3

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 10:09 pm BST Dec 8,2013

It’s finally time to take a close look at the parallel versions of the execution plan I produced a little while ago for a four-table hash join. In this note I’ll examine the broadcast parallel distribution. First, here’s a list of the hints I’m going to use to get the effect I want:

	/*+
		leading(t4 t1 t2 t3)
		full(t4) parallel(t4, 2)
		use_hash(t1) swap_join_inputs(t1) pq_distribute(t1 none broadcast)
		full(t1) parallel(t1, 2)
		use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 none broadcast)
		full(t2) parallel(t2, 2)
		use_hash(t3) swap_join_inputs(t3) pq_distribute(t3 none broadcast)
		full(t3) parallel(t3, 2)
		monitor
	*/

(more…)

November 5, 2013

Deadlock

Filed under: Oracle,Parallel Execution,Partitioning — Jonathan Lewis @ 6:14 am BST Nov 5,2013

There an interesting example of a deadlock on the OTN database forum:

DEADLOCK DETECTED ( ORA-00060 )
[Transaction Deadlock]

Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
PS-00000001-00000011        92     423     S             33     128     S     X
BF-2ed08c01-00000000        33     128     S             92     423     S     X

(more…)

October 14, 2013

Parallel Execution – 2

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:29 pm BST Oct 14,2013

Since I’m going to write a couple of articles dissecting parallel execution plans, I thought I’d put up a reference post describing the set of tables I used to generate the plan from the previous post, and the query (with serial execution plan) that produced it. The setup is a simple star schema arrangement – which I’ve generated by created by creating three identical tables and then doing a Cartesian join across the three of them.

(more…)

October 13, 2013

Parallel Execution – 1

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 7:43 pm BST Oct 13,2013

When you read an execution plan you’re probably trying to identify the steps that Oracle went through to acquire the final result set so that you can decide whether or not there is a more efficient way of getting the same result.

For a serial execution plan this typically means you have to identify the join order, join methods and access methods together with the point at which each predicate was applied. Generally this isn’t too difficult, although subquery pushing (and a couple of presentation bugs relating to other uses of scalar subqueries) can cause a little confusion; and the difference between join order and the order of operation can be slightly obscured when considering hash joins.

Parallel execution plans are harder, though, because you really need to understand the impact of the order of operation, distribution mechanisms chosen, and (in recent versions of Oracle) the timing of the generation and use of Bloom filters. The topic is stunningly large and easy to describe badly; it’s also going to be easy to make generalisations that turn out to be untrue or (at least) sufficiently incomplete as to be misleading. Rather than attempting to cover the topic in one note, I think I’m going to end up writing two or three.

(more…)

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 5,216 other followers