Oracle Scratchpad

May 12, 2015

Parallel Query

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 7:22 pm BST May 12,2015

According to the Oracle Database VLDB and Partitioning Guide (10g version and 11g version):

A SELECT statement can be executed in parallel only if the following conditions are satisfied:

  • The query includes a parallel hint specification (PARALLEL or PARALLEL_INDEX) or the schema objects referred to in the query have a PARALLEL declaration associated with them.
  • At least one table specified in the query requires one of the following:
    • A full table scan
    • An index range scan spanning multiple partitions
  • No scalar subqueries are in the SELECT list.

Note, particularly, that last restriction. I was looking at a query recently that seemed to be breaking this rule so, after examining the 10053 trace file for a while, I decided that I would construct a simplified model of the client’s query to demonstrate how the manuals can tell you the truth while being completely deceptive or (conversely) be wrong while still giving a perfectly correct impression. So here’s a query, with execution plan, from 11.2.0.4:

rem
rem     Script:         parallel_scalar_subq.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jun 2013
rem
rem     Last tested
rem             12.1.0.2        New options
rem             11.2.0.4        Scalar Subqueries run by QC
rem

select
        /*+ parallel(t1 2) */
        d1.small_vc,
        t1.r1,
        t2.n21,
        t2.v21,
        t3.v31,
        (select max(v1) from ref1 where n1 = t2.n21)    ref_t2,
        (select max(v1) from ref2 where n1 = t1.r1)     ref_t1,
        t1.padding
from
        driver          d1,
        t1, t2, t3
where
        d1.n1 = 1
and     t1.n1 = d1.id
and     t1.n2 = 10
and     t1.n3 = 10
and     t2.id = t1.r2
and     t3.id = t1.r3
;

----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 15700 |  1340   (3)| 00:00:07 |        |      |            |
|   1 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   2 |   TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  3 |    INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   4 |  SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|   5 |   TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|*  6 |    INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|   7 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   8 |   PX SEND QC (RANDOM)        | :TQ10003 |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | P->S | QC (RAND)  |
|*  9 |    HASH JOIN                 |          |   100 | 15700 |  1340   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 10 |     HASH JOIN                |          |   100 | 14700 |  1317   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|* 11 |      HASH JOIN               |          |   100 | 13300 |  1294   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  12 |       BUFFER SORT            |          |       |       |            |          |  Q1,03 | PCWC |            |
|  13 |        PX RECEIVE            |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |         PX SEND BROADCAST    | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 15 |          TABLE ACCESS FULL   | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  16 |       PX BLOCK ITERATOR      |          |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWC |            |
|* 17 |        TABLE ACCESS FULL     | T1       |   100 | 12000 |  1290   (3)| 00:00:07 |  Q1,03 | PCWP |            |
|  18 |      BUFFER SORT             |          |       |       |            |          |  Q1,03 | PCWC |            |
|  19 |       PX RECEIVE             |          | 10000 |   136K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  20 |        PX SEND BROADCAST     | :TQ10001 | 10000 |   136K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  21 |         TABLE ACCESS FULL    | T2       | 10000 |   136K|    23   (5)| 00:00:01 |        |      |            |
|  22 |     BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|  23 |      PX RECEIVE              |          | 10000 |    97K|    23   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  24 |       PX SEND BROADCAST      | :TQ10002 | 10000 |    97K|    23   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  25 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    23   (5)| 00:00:01 |        |      |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("N1"=:B1)
   6 - access("N1"=:B1)
   9 - access("T3"."ID"="T1"."R3")
  10 - access("T2"."ID"="T1"."R2")
  11 - access("T1"."N1"="D1"."ID")
  15 - filter("D1"."N1"=1)
  17 - filter("T1"."N2"=10 AND "T1"."N3"=10)

Thanks to my hint the query has been given a parallel execution plan – and a check of v$pq_tqstat after running the query showed that it had run parallel. Note, however, where the PX SEND QC and PX COORDINATOR operations appear – lines 7 and 8, and above those lines we see the two scalar subqueries.

This means we’re running the basic select statement as a parallel query but the query co-ordinator has serialised on the scalar subqueries in the select list.  Is the manual “right but deceptive” or “wrong but giving the right impression” ?  Serialising on (just) the scalar subqueries can have a huge impact on the performance and effectively make the query behave like a serial query even though, technically, the statement has run as a parallel query.

You may recall that an example of this type of behaviour, and its side effects when the scalar subqueries executed independently as parallel queries, showed up some time ago. At the time I said I would follow up with a note about the change in behaviour in 12c; this seems to be an appropriate moment to show the 12c plan(s), first the default:


----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |   100 | 19100 |  1364   (3)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10005 |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED        |          |   100 | 19100 |  1364   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  4 |     HASH JOIN OUTER          |          |   100 | 18100 |  1340   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  5 |      HASH JOIN               |          |   100 | 16400 |  1335   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  6 |       HASH JOIN OUTER        |          |   100 | 15000 |  1311   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|*  7 |        HASH JOIN             |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|   8 |         PX RECEIVE           |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,05 | PCWP |            |
|   9 |          PX SEND BROADCAST   | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | S->P | BROADCAST  |
|  10 |           PX SELECTOR        |          |       |       |            |          |  Q1,00 | SCWC |            |
|* 11 |            TABLE ACCESS FULL | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,00 | SCWP |            |
|  12 |         PX BLOCK ITERATOR    |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWC |            |
|* 13 |          TABLE ACCESS FULL   | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,05 | PCWP |            |
|  14 |        PX RECEIVE            |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  15 |         PX SEND BROADCAST    | :TQ10001 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | S->P | BROADCAST  |
|  16 |          PX SELECTOR         |          |       |       |            |          |  Q1,01 | SCWC |            |
|  17 |           VIEW               | VW_SSQ_1 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  18 |            HASH GROUP BY     |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,01 | SCWC |            |
|  19 |             TABLE ACCESS FULL| REF2     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,01 | SCWP |            |
|  20 |       PX RECEIVE             |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  21 |        PX SEND BROADCAST     | :TQ10002 | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | S->P | BROADCAST  |
|  22 |         PX SELECTOR          |          |       |       |            |          |  Q1,02 | SCWC |            |
|  23 |          TABLE ACCESS FULL   | T2       | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,02 | SCWP |            |
|  24 |      PX RECEIVE              |          |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,05 | PCWP |            |
|  25 |       PX SEND BROADCAST      | :TQ10003 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | S->P | BROADCAST  |
|  26 |        PX SELECTOR           |          |       |       |            |          |  Q1,03 | SCWC |            |
|  27 |         VIEW                 | VW_SSQ_2 |  1000 | 17000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  28 |          HASH GROUP BY       |          |  1000 | 10000 |     5  (20)| 00:00:01 |  Q1,03 | SCWC |            |
|  29 |           TABLE ACCESS FULL  | REF1     |  1000 | 10000 |     4   (0)| 00:00:01 |  Q1,03 | SCWP |            |
|  30 |     PX RECEIVE               |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,05 | PCWP |            |
|  31 |      PX SEND BROADCAST       | :TQ10004 | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | S->P | BROADCAST  |
|  32 |       PX SELECTOR            |          |       |       |            |          |  Q1,04 | SCWC |            |
|  33 |        TABLE ACCESS FULL     | T3       | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,04 | SCWP |            |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T3"."ID"="T1"."R3")
   4 - access("ITEM_2"(+)="T2"."N21")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("ITEM_1"(+)="T1"."R1")
   7 - access("T1"."N1"="D1"."ID")
  11 - filter("D1"."N1"=1)
  13 - filter("T1"."N2"=10 AND "T1"."N3"=10)

The first thing to note is the location of the PX SEND QC and PX COORDINATOR operations – right at the top of the plan: there’s no serialisation at the query coordinator. Then we spot the views at operations 17 and 27 – VW_SSQ_1, VW_SSQ_2 (would SSQ be “scalar subquery”, perhaps). The optimimzer has unnested the scalar subqueries out of the select list into the join. When a scalar subquery in the select list returns no data it’s value is deemed to be NULL so the joins (operations 4 and 6) have to be outer joins.

You’ll notice that there are a lot of PX SELECTOR operations – each feeding a PX SEND BROADCAST operations that reports its “IN-OUT” column as S->P (i.e. serial to parallel). Historically a serial to parallel operation started with the query coordinator doing the serial bit but in 12c the optimizer can dictate that one of the PX slaves should take on that task (see Randolf Geist’s post here). Again my code to report v$pq_tqstat confirmed this behaviour in a way that we shall see shortly.

This type of unnesting is a feature of 12c and in some cases will be very effective. It is a cost-based decision, though, and the optimizer can make mistakes; fortunately we can control the feature. We could simply set the optimizer_features_enable back to 11.2.0.4 (perhaps through the hint) and this would take us back to the original plan, but this isn’t the best option in this case. There is a hidden parameter _optimizer_unnest_scalar_sq enabling the feature so we could, in principle, disable just the one feature by setting that parameter to false; a more appropriate strategy would simply be to tell the optimizer that it should not unnest the subqueries. In my case I could put the /*+ no_unnest */ hint into both the subqueries or use the qb_name() hint to give the two subquery blocks names, and then used the /*+ no_unnest() */ hint with the “@my_qb_name” format at the top of the main query. Here’s the execution plan I get whether I use the hidden parameter or the /*+ no_unnest */ mechanim:

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |       |       |  1554 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                 |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10003 |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    EXPRESSION EVALUATION        |          |       |       |            |          |  Q1,03 | PCWC |            |
|*  4 |     HASH JOIN                   |          |   100 | 15700 |  1354   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  5 |      HASH JOIN                  |          |   100 | 14700 |  1330   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|*  6 |       HASH JOIN                 |          |   100 | 13300 |  1306   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|   7 |        BUFFER SORT              |          |       |       |            |          |  Q1,03 | PCWC |            |
|   8 |         PX RECEIVE              |          |   100 |  1300 |     4   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|   9 |          PX SEND BROADCAST      | :TQ10000 |   100 |  1300 |     4   (0)| 00:00:01 |        | S->P | BROADCAST  |
|* 10 |           TABLE ACCESS FULL     | DRIVER   |   100 |  1300 |     4   (0)| 00:00:01 |        |      |            |
|  11 |        PX BLOCK ITERATOR        |          |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWC |            |
|* 12 |         TABLE ACCESS FULL       | T1       |   100 | 12000 |  1302   (3)| 00:00:01 |  Q1,03 | PCWP |            |
|  13 |       BUFFER SORT               |          |       |       |            |          |  Q1,03 | PCWC |            |
|  14 |        PX RECEIVE               |          | 10000 |   136K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  15 |         PX SEND BROADCAST       | :TQ10001 | 10000 |   136K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  16 |          TABLE ACCESS FULL      | T2       | 10000 |   136K|    24   (5)| 00:00:01 |        |      |            |
|  17 |      BUFFER SORT                |          |       |       |            |          |  Q1,03 | PCWC |            |
|  18 |       PX RECEIVE                |          | 10000 |    97K|    24   (5)| 00:00:01 |  Q1,03 | PCWP |            |
|  19 |        PX SEND BROADCAST        | :TQ10002 | 10000 |    97K|    24   (5)| 00:00:01 |        | S->P | BROADCAST  |
|  20 |         TABLE ACCESS FULL       | T3       | 10000 |    97K|    24   (5)| 00:00:01 |        |      |            |
|  21 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  22 |      TABLE ACCESS BY INDEX ROWID| REF1     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 23 |       INDEX UNIQUE SCAN         | R1_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
|  24 |     SORT AGGREGATE              |          |     1 |    10 |            |          |        |      |            |
|  25 |      TABLE ACCESS BY INDEX ROWID| REF2     |     1 |    10 |     2   (0)| 00:00:01 |        |      |            |
|* 26 |       INDEX UNIQUE SCAN         | R2_PK    |     1 |       |     1   (0)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T3"."ID"="T1"."R3")
   5 - access("T2"."ID"="T1"."R2")
   6 - access("T1"."N1"="D1"."ID")
  10 - filter("D1"."N1"=1)
  12 - access(:Z>=:Z AND :Z<=:Z)
       filter(("T1"."N2"=10 AND "T1"."N3"=10))
  23 - access("N1"=:B1)
  26 - access("N1"=:B1)

Note particularly that the PX SEND QC and PX COORDINATOR operations are operations 2 and 1,, and we have a new operator EXPRESSSION EVALUATION at operation 3. This has three child operations – the basic select starting at operation 4, and the two scalar subqueries starting at lines 21 and 24. We are operating the scalar subqueries as correlated subqueries, but we don’t leave all the work to the query coordinator – each slave is running its own subqueries before forwarding the final result to the coordinator. There is a little side effect that goes with this change – the “serial to parallel” operations are now, as they always used to be, driven by the query co-ordinator, the PX SELECTOR operations have disappeared.

And finally

Just to finish off, let’s take a look at the results from v$pq_tqstat in 12.1.0.2. First from the default plan with the PX SELECTOR operations. Remember that this turned into a five table join where two of the “tables” were non-correlated aggregate queries against the reference tables.


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P002                   200       2428          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                   100       1238         59         27           0
                                             1 P001                   100       1238         41         24           0

                    1 Producer               1 P002                  2000      23830          0          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                  1000      11939         57         26           0
                                             1 P001                  1000      11939         41         24           0

                    2 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     339732          0          0           0
                      Consumer               1 P000                 10000     169890         49         22           0
                                             1 P001                 10000     169890         31         21           0

                    3 Producer               1 P002                     0         48          0          0           0
                                             1 P003                  2000      23830          0          0           0
                      Consumer               1 P000                  1000      11939         58         26           0
                                             1 P001                  1000      11939         38         23           0

                    4 Producer               1 P002                     0         48          0          0           0
                                             1 P003                 20000     239986          0          0           0
                      Consumer               1 P000                 10000     120017         50         22           0
                                             1 P001                 10000     120017         34         21           0

                    5 Producer               1 P000                     1        169          0          0           0
                                             1 P001                     1        169          1          0           0
                      Consumer               1 QC                       2        338          3          0           0

As you read down the table queues you can see that in the first five table queues (0 – 4) we seem to operate parallel to parallel, but only one of the two producers (p002 and p003) produces any data at each stage. A more traditional plan would show QC as the single producer in each of these stages.

Now with scalar subquery unnesting blocked – the plan with the three table join and EXPRESSION EVALUATION – we see the more traditional serial to parallel, the producer is QC in all three of the first table queues (the full scan and broadcast of tables t1, t2, and t3).

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 QC                     200       1726          0          0           0
                      Consumer               1 P000                   100       1614         28         15           0
                                             1 P001                   100       1614         34         13           0

                    1 Producer               1 QC                   20000     339732          0          0           0
                      Consumer               1 P000                 10000     169866         19         10           0
                                             1 P001                 10000     169866         25          8           0

                    2 Producer               1 QC                   20000     239986          0          0           0
                      Consumer               1 P000                 10000     119993         23         11           0
                                             1 P001                 10000     119993         31         11           0

                    3 Producer               1 P000                     1        155          1          0           0
                                             1 P001                     1        155          0          0           0
                      Consumer               1 QC                       2        310          3          1           0

It’s an interesting point that this last set of results is identical to the set produced in 11g – you can’t tell from v$pq_tqstat whether the parallel slaves or the query co-ordinator executed the subqueries – you have to look at the output from SQL trace (or similar) to see the individual Rowsource Executions Statistics for the slaves and coordinator to see which process actually ran the subqueries.

 

11 Comments »

  1. Tanel Poder described this new 12c unnesting CBO enhancement in a little bit more detail: http://blog.tanelpoder.com/2013/08/13/oracle-12c-scalar-subquery-unnesting-transformation/

    Comment by Stefan Koehler — May 14, 2015 @ 12:33 pm BST May 14,2015 | Reply

  2. […] Jonathan Lewis “Parallel Query” […]

    Pingback by Oracle SQL | 12c: New SQL PLAN OPERATIONS and HINTS — July 7, 2016 @ 7:53 pm BST Jul 7,2016 | Reply

  3. Hi Jonathan,

    Did you have a chance to check whether or not every parallel slave process works on evaluating scalar subqueries?
    I’m trying to make all slaves work on that calculation, but what I see from statistics and execution time of my query, only a single slave actually does the calculation.
    What can prohibit other slaves to work on the calculation of scalar subqueries in a select list?

    I put here just part of the SQL Monitor output.
    The entire report is available here https://docs.google.com/document/d/1Uq-cmIkRHlCzFQTuTekrvGAoK39s78hCqJe8Wdz1vQc/edit?usp=sharing

    Parallel Execution Details (DOP=16 , Servers Allocated=32)  
    ===============================================================================================================================================================  
    |      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    | Application | Concurrency | Buffer | Read  | Read  | Write | Write |     Wait Events      |  
    |                |       |         | Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Waits(s)   |  Gets  | Reqs  | Bytes | Reqs  | Bytes |      (sample #)      |  
    ===============================================================================================================================================================  
    | PX Coordinator | QC    |         |      41 |      41 |          |        0.00 |             |    12M |       |     . |       |     . |                      |  
    | p000           | Set 1 |       1 |    0.40 |    0.40 |          |             |             |        |       |     . |       |     . |                      |  
    | p001           | Set 1 |       2 |    2.66 |    2.66 |          |             |             |        |       |     . |       |     . |                      |  
    | p002           | Set 1 |       3 |    2.08 |    2.08 |          |             |             |        |       |     . |       |     . |                      |  
    | p003           | Set 1 |       4 |    2.40 |    2.40 |          |             |             |        |       |     . |       |     . |                      |  
    | p004           | Set 1 |       5 |    0.39 |    0.39 |          |             |             |        |       |     . |       |     . |                      |  
    | p005           | Set 1 |       6 |    1.61 |    1.61 |          |             |             |        |       |     . |       |     . |                      |  
    | p006           | Set 1 |       7 |      55 |      55 |     0.15 |             |             |    12M |  4376 |   1GB |  4353 |   1GB |                      |  
    | p007           | Set 1 |       8 |    0.75 |    0.75 |          |             |             |        |       |     . |       |     . |                      |  
    | p008           | Set 1 |       9 |    1.30 |    1.30 |          |             |             |        |       |     . |       |     . |                      |  
    | p009           | Set 1 |      10 |    0.35 |    0.35 |          |             |             |        |       |     . |       |     . |                      |  
    | p00a           | Set 1 |      11 |    0.29 |    0.29 |          |             |             |        |       |     . |       |     . |                      |  
    | p00b           | Set 1 |      12 |    0.33 |    0.33 |          |             |             |        |       |     . |       |     . |                      |  
    | p00c           | Set 1 |      13 |    1.79 |    1.79 |          |             |             |        |       |     . |       |     . |                      |  
    | p00d           | Set 1 |      14 |    2.01 |    2.01 |          |             |             |        |       |     . |       |     . |                      |  
    | p00e           | Set 1 |      15 |    0.40 |    0.40 |          |             |             |        |       |     . |       |     . |                      |  
    | p00f           | Set 1 |      16 |    1.08 |    1.08 |          |             |             |        |       |     . |       |     . |                      |  
    | p00g           | Set 2 |       1 |    7.42 |    6.93 |     0.49 |             |             |   152K |  9093 |   1GB |       |     . |                      |  
    | p00h           | Set 2 |       2 |    8.27 |    7.85 |     0.42 |             |             |   194K | 11377 |   1GB |       |     . | direct path read (1) |  
    | p00i           | Set 2 |       3 |    7.20 |    6.70 |     0.50 |             |             |   169K | 10081 |   1GB |       |     . | direct path read (1) |  
    | p00j           | Set 2 |       4 |    6.93 |    6.68 |     0.25 |             |             |   148K |  8798 |   1GB |       |     . |                      |  
    | p00k           | Set 2 |       5 |    7.55 |    7.26 |     0.29 |             |             |   178K | 10424 |   1GB |       |     . |                      |  
    | p00l           | Set 2 |       6 |    7.36 |    7.01 |     0.35 |             |             |   194K | 10789 |   1GB |       |     . | direct path read (1) |  
    | p00m           | Set 2 |       7 |    8.28 |    7.74 |     0.54 |             |        0.00 |   190K | 10764 |   1GB |       |     . |                      |  
    | p00n           | Set 2 |       8 |      10 |    8.66 |     0.92 |             |             |   199K | 11897 |   1GB |       |     . | direct path read (1) |  
    | p00o           | Set 2 |       9 |    7.66 |    7.49 |     0.18 |             |             |   167K |  9627 |   1GB |       |     . |                      |  
    | p00p           | Set 2 |      10 |    7.94 |    7.63 |     0.31 |             |             |   176K | 10795 |   1GB |       |     . | direct path read (2) |  
    | p00q           | Set 2 |      11 |    6.14 |    5.82 |     0.32 |             |             |   131K |  7795 |   1GB |       |     . |                      |  
    | p00r           | Set 2 |      12 |    7.29 |    6.96 |     0.33 |             |             |   171K |  9370 |   1GB |       |     . | direct path read (1) |  
    | p00s           | Set 2 |      13 |    8.06 |    7.61 |     0.45 |             |             |   168K |  9848 |   1GB |       |     . |                      |  
    | p00t           | Set 2 |      14 |    8.01 |    7.62 |     0.40 |             |             |   196K | 11424 |   1GB |       |     . |                      |  
    | p00u           | Set 2 |      15 |    7.41 |    7.11 |     0.30 |             |             |   157K |  9356 |   1GB |       |     . |                      |  
    | p00v           | Set 2 |      16 |    8.26 |    7.93 |     0.33 |             |             |   185K | 10916 |   1GB |       |     . | direct path read (2) |  
    ===============================================================================================================================================================
    

    Comment by Yuri — October 3, 2019 @ 2:44 pm BST Oct 3,2019 | Reply

    • Yuri,

      One thing I notice about your execution plan is that operation 10 predicts and gets just one row – which it then distributes by HASH. This means that only one of the 16 PX servers gets that row to do a hash join – and that may be why the Expression Evaluation appears to be happening only at one server. It’s odd that the optimizer has done this, but the first thing I’d try to bypass the anomaly is to use the pq_distribute() hint to BROADCAST that one row to all 16 PX servers and see what happens then.

      You didn’t post the SQL, but it looks as if you may have three scalar subqueries, with two being subject to “expression evaluation” and one being operated by the query coordinator – which is interesting if true.

      Comment by Jonathan Lewis — October 3, 2019 @ 4:41 pm BST Oct 3,2019 | Reply

      • Actually, the main query produced 12M rows and the unique PK column was passed to the scalar subquery as a correlation predicate, so hash distribution must have spread all rows evenly.
        I’ll create some simple test-case to demonstrate the issue.

        Comment by Yuri — October 4, 2019 @ 7:06 am BST Oct 4,2019 | Reply

        • Yuri,

          
          SQL Plan Monitoring Details (Plan Hash Value=2564914776)
          =======================================================================================================================================================================================================================
          | Id |                    Operation                    |        Name        |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  | Write | Write |  Mem  | Temp  | Activity |   Activity Detail    |
          |    |                                                 |                    | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)    |     (# samples)      |
          =======================================================================================================================================================================================================================
          |  0 | SELECT STATEMENT                                |                    |         |      |        42 |    +19 |       |          |      |       |       |       |       |       |     8.41 | Cpu (19)             |
          |  1 |   TABLE ACCESS BY USER ROWID                    | OBJ                |       1 |    1 |           |        |       |          |      |       |       |       |       |       |          |                      |
          |  2 |   PX COORDINATOR                                |                    |         |      |        43 |    +19 |    33 |      12M |      |       |       |       |       |       |     0.88 | Cpu (2)              |
          |  3 |    PX SEND QC (RANDOM)                          | :TQ10002           |    400K | 913K |        42 |    +20 |    16 |      12M |      |       |       |       |       |       |     1.77 | Cpu (4)              |
          |  4 |     EXPRESSION EVALUATION                       |                    |         |      |        42 |    +20 |    16 |      12M |      |       |       |       |       |       |     2.65 | Cpu (6)              |
          |  5 |      VIEW                                       |                    |    400K | 913K |        42 |    +20 |    16 |      12M |      |       |       |       |       |       |     1.77 | Cpu (4)              |
          |  6 |       HASH JOIN BUFFERED                        |                    |    400K | 119K |        61 |     +1 |    16 |      12M | 4353 |   1GB |  4353 |   1GB |   11M |    1G |     7.08 | Cpu (16)             |
          |  7 |        BUFFER SORT                              |                    |         |      |         1 |     +2 |    16 |        1 |      |       |       |       |  2048 |       |          |                      |
          |  8 |         PX RECEIVE                              |                    |       1 |    5 |         1 |     +2 |    16 |        1 |      |       |       |       |       |       |          |                      |
          |  9 |          PX SEND HASH                           | :TQ10000           |       1 |    5 |         1 |     +2 |     1 |        1 |      |       |       |       |       |       |          |                      |
          | 10 |           VIEW                                  |                    |       1 |    5 |         1 |     +2 |     1 |        1 |      |       |       |       |       |       |          |                      |
          | 11 |            FILTER                               |                    |         |      |         1 |     +2 |     1 |        1 |      |       |       |       |       |       |          |                      |
          | 12 |             SORT GROUP BY                       |                    |       1 |    5 |         1 |     +2 |     1 |        1 |      |       |       |       |  2048 |       |          |                      |
          | 13 |              NESTED LOOPS OUTER                 |                    |       3 |    4 |         1 |     +2 |     1 |        1 |      |       |       |       |       |       |          |                      |
          | 14 |               INDEX RANGE SCAN                  | TAB_T_PK           |       2 |    1 |         1 |     +2 |     1 |        1 |      |       |       |       |       |       |          |                      |
          | 15 |               COLLECTION ITERATOR PICKLER FETCH |                    |       2 |    2 |         1 |     +2 |     1 |        1 |      |       |       |       |       |       |          |                      |
          | 16 |        PX RECEIVE                               |                    |    650M | 118K |        19 |     +2 |    16 |      41M |      |       |       |       |       |       |     3.10 | Cpu (7)              |
          | 17 |         PX SEND HASH                            | :TQ10001           |    650M | 118K |        19 |     +1 |    16 |     154M |      |       |       |       |       |       |    31.42 | Cpu (71)             |
          | 18 |          PX BLOCK ITERATOR                      |                    |    650M | 118K |        19 |     +1 |    16 |     154M |      |       |       |       |       |       |     8.85 | Cpu (20)             |
          | 19 |           INDEX FAST FULL SCAN                  | XIF1_OBJ           |    650M | 118K |        19 |     +1 |   223 |     154M | 162K |  20GB |       |       |       |       |    15.49 | Cpu (26)             |
          |    |                                                 |                    |         |      |           |        |       |          |      |       |       |       |       |       |          | direct path read (9) |
          | 20 |      COUNT STOPKEY                              |                    |         |      |        38 |    +20 |   24M |     1500 |      |       |       |       |       |       |     4.42 | Cpu (10)             |
          | 21 |       INDEX RANGE SCAN                          | XIF1_PARS          |       1 |    1 |        41 |    +20 |   24M |     1500 |   23 | 184KB |       |       |       |       |    14.16 | Cpu (32)             |
          | 22 |      TABLE ACCESS BY USER ROWID                 | PARS               |       1 |    1 |           |        |       |          |      |       |       |       |       |       |          |                      |
          =======================================================================================================================================================================================================================
          
          
          

          The way I read it, the activity starts with operations 10 to 15 running serially and producing one row. This one row is then “hash” distributed to the first set of PX servers, and the “actual” rows shows that there;s only one row, which means only one PX server can do anything further work in the hash join.

          The next stage is the fast full scan of XIF1_OBJ which does a hash distribution of 41M rows. The single PX server that has a row operates the hash join at operation 6 – the other PX servers receive and immediately discard any incoming data because their in-memory build tables are empty.

          The “expression evaluation” executes at operation 4 and appears to have operations 20 and 22 as its children – which suggests that IF there had been more than one PX server with data those subqueries would have been running concurrently across all 16 servers.

          Operation 3 – which executes as the parent to the expression evaluation – passes the data to the query coordinator, which runs the scalar subquery in the select list that is operation 1 (I realised a little late why there was one scalar subquery that was behaving differently from the other two).

          I would still suggest your first experiment would be to hint the plan to broadcast the results at operation 9.

          Comment by Jonathan Lewis — October 4, 2019 @ 10:04 am BST Oct 4,2019

  4. Jonathan,

    At first, I very appreciate the time you spend with my question.

    “to use the pq_distribute() hint” – somehow I have not managed to make CBO change the distribution method at step 9 of the plan.

    What I managed to do was that I changed the distribution method for XIF1_OBJ and seems subquery processing also got distributed https://docs.google.com/document/d/11ukbkeTOEpdREZ6GtPVNYqa9x1b1BFhSicIUvaqvrlQ/edit?usp=sharing
    But in this case, the shape of the execution plan also changed.

    I also tried to create some synthetic case which more or less close reproduces the shape of the original execution plan.
    Here is a script https://drive.google.com/open?id=1IOezyVmCzX4S_MTbWAyrbozfeMUww7l9
    And two results:
    – partially reproduced the uneven distribution of scalar subquery processing https://drive.google.com/open?id=1iA1ke2YDy5HA60JHGsTXD23-jvMcI9ko
    – even distribution https://drive.google.com/open?id=1YFp0qCSVXr9e3yRDtkXgSn5-vxCz6jn5

    The only difference between those two tests is that the first one was run when the optimizer statistics on T_REF and T_MAIN were quite unprecise. After statistics were regathered, even distribution appeared.
    But in the latter case, Oracle employed HYBRID HASH even though all adaptive features are disabled on my database instance.

    Comment by Yuri — October 4, 2019 @ 4:05 pm BST Oct 4,2019 | Reply

    • Yuri,

      Looking at the two links I’ve got, Step 9 changes from a SEND HASH to a SEND BROADCAST, which means the one row turns into 16 copies, and each slave gets a copy and then fast full scans 1/16th of the XIF1_OBJ index without having to do any inter-slave messaging: and that’s a good thing. It looks to me as if the plan has done exactly the right thing (with a little bad luck on the hash distribution) that allows the expression evaluation to be running in concurrent processes.

      Comment by Jonathan Lewis — October 4, 2019 @ 5:19 pm BST Oct 4,2019 | Reply

  5. Jonathan,

    It does not make any sense for me so far, it looks like a kind of “double counting”, but numbers say it is rather “double execution”.
    A very simple query with a scalar subquery in SELECT list: https://drive.google.com/open?id=1L-srQNcI53vlScZPaSPNMajonalL2RtY
    Parallel execution (degree 16) 124 secs, not much faster than the serial run below
    – AWR https://drive.google.com/open?id=1UXsrbhTlIAHb8_Dkd2-RdOQoDEcI299z
    – SQL run-time stats https://drive.google.com/open?id=1ri0ncRMxCEERscIcbZCLOEmZrGSbscWw
    – V$PQ_TQSTAT https://drive.google.com/open?id=19M5S7gv34wn6OyDoulfQFyR8u1rkSo6T
    Serial execution takes 133 secs
    – AWR https://drive.google.com/open?id=1V6Pso9c5SI7SvuINWQlqJ_qlPGG0H6_q
    – SQL run-time stats https://drive.google.com/open?id=1XODIJwjGXrP99E6QtPQckv30Sa1DN_WC

    Comment by Yuri — October 8, 2019 @ 12:35 pm BST Oct 8,2019 | Reply

    • Sorry, I mean LIO numbers for scalar subquery execution steps. The parallel execution shows doubled LIO relatively to the serial execution. I attached AWR reports for both tests to crosscheck the LIO number by “Instance activity” sections.

      Comment by Yuri — October 8, 2019 @ 2:20 pm BST Oct 8,2019 | Reply

    • Yuri,

      I’ve downloaded your test script and run the data creation and query parts on 12.2.0.1 and 19.3.0.0 and there’s something very buggy going on.
      As it says at the bottom of the article: “you have to look at the output from SQL trace (or similar) to see the individual Rowsource Executions Statistics for the slaves and coordinator to see which process actually ran the subqueries.” so I did.

      The QC and the parallel execution slaves both ran the scalar subquery.

      I ran parallel 4, then used tkprof on the QC and P00n trace files, then grep’ed out the line from the Rowsource Execution Statistics in the output files that was the “SORT AGGREGATE” line for the scalar subquery. Your script as supplied created only 50,000 rows in the main table – this is the result I got from 12.2

      p0.prf:      1744       6842      11940     SORT AGGREGATE (cr=138645 pr=118 pw=0 time=3442179 us starts=1744)
      p1.prf:      2044       7256      12469     SORT AGGREGATE (cr=147822 pr=99 pw=0 time=3797850 us starts=2044)
      p2.prf:      1762       7358      12954     SORT AGGREGATE (cr=150058 pr=96 pw=0 time=4083545 us starts=1762)
      p3.prf:      1892       7264      12637     SORT AGGREGATE (cr=146514 pr=104 pw=0 time=3837480 us starts=1892)
      qc.prf:      50000      50000      50000     SORT AGGREGATE (cr=1025585 pr=0 pw=0 time=26585851 us starts=50000)
      
      

      And this is from 19.3 – just a list difference in the way Oracle has done the first/avg/max row reporting (which looks highly suspect in the 12.2. report):

      
      p0.prf:     12605      12605      12605     SORT AGGREGATE (cr=259225 pr=177 pw=0 time=8497533 us starts=12605)
      p1.prf:     12572      12572      12572     SORT AGGREGATE (cr=259558 pr=128 pw=0 time=8046543 us starts=12572)
      p2.prf:     12592      12592      12592     SORT AGGREGATE (cr=262584 pr=153 pw=0 time=8178479 us starts=12592)
      p3.prf:     12231      12231      12231     SORT AGGREGATE (cr=242208 pr=149 pw=0 time=8093948 us starts=12231)
      qc.prf:     50000      50000      50000     SORT AGGREGATE (cr=1024368 pr=0 pw=0 time=41495621 us starts=50000)
      
      

      Until further notice and some PX tracing I’m inclined to think there’s an odd PL/SQL effect in the 12.2 results because every slave reported 2 executions of the query, leading to the split figures. The 12.2 “max” figures are the total number.

      For a couple of extra checks I created a clone of t_main and used that in the subquery – and the slaves reported a small number of physical reads of the subquery table and index whle the query coordinator reported only buffer gets – suggesting that the slaves had read the blocks before the QC repeating the subquery evaluation.

      I also queried v$px_sesstat while the query was running, and this was consistent with slaves and QC running the subquery, and then the slave LIO being summed up into the QC stats as the query ended.

      SO: looks like a bug.

      Comment by Jonathan Lewis — October 8, 2019 @ 3:43 pm BST Oct 8,2019 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.