Oracle Scratchpad

February 12, 2015

Parallel rownum

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 7:27 am GMT Feb 12,2015

It’s easy to make mistakes, or overlook defects, when constructing parallel queries – especially if you’re a developer who hasn’t been given the right tools to make it easy to test your code. Here’s a little trap I came across recently that’s probably documented somewhere, which could be spotted easily if you had access to the OEM SQL Monitoring screen, but would be very easy to miss if you didn’t check the execution plan very carefully. I’ll start with a little script to generate some data:


create table t1 nologging
as
select * from all_objects where rownum <= 50000
;

insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;
insert /*+ append */ into t1 select * from t1;
commit;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

create table t2 as select * from t1;
alter table t2 add id number(10,0);

All I’ve done is create some data – 800,000 rows – and then create a table to copy it to; and while I copy it I’m going to add a temporary id to the rows, which I’ll do with a call to rownum; and since there’s a lot of data I’ll use parallel execution:


alter session enable parallel dml;

insert /*+ parallel(t2 3) */ into t2
select /*+ parallel(t1 4) */ t1.* , rownum from t1;

For the purposes of experiment and entertainment I’ve done something a little odd by supplying two possible degrees of parallelism, but this lets me ask the question: will this statement run parallel 3, parallel 4, both of the above, or neither ? (You may assume that I have enough parallel execution slaves available when the statement runs.)

The answer is both – because that rownum does something nasty to the execution plan (I didn’t include the 50,000 limit when creating t1 in my original test, which is why the plan reports 993K rows instead of 800,000):


--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT           |          |   993K|    92M|  1076   (1)| 00:00:13 |        |      |            |
|   1 |  PX COORDINATOR            |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)      | :TQ20001 |   993K|    92M|  1076   (1)| 00:00:13 |  Q2,01 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT          | T2       |       |       |            |          |  Q2,01 | PCWP |            |
|   4 |     PX RECEIVE             |          |   993K|    92M|  1076   (1)| 00:00:13 |  Q2,01 | PCWP |            |
|   5 |      PX SEND ROUND-ROBIN   | :TQ20000 |   993K|    92M|  1076   (1)| 00:00:13 |        | S->P | RND-ROBIN  |
|   6 |       COUNT                |          |       |       |            |          |        |      |            |
|   7 |        PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   8 |         PX SEND QC (RANDOM)| :TQ10000 |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | P->S | QC (RAND)  |
|   9 |          PX BLOCK ITERATOR |          |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL| T1       |   993K|    92M|  1076   (1)| 00:00:13 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------

See that “P->S” (parallel to serial) at operation 8. The select statement runs in parallel (degree 4) to scan the data, and then sends it all to the query co-ordinator to supply the rownum; then the query co-ordinator re-distributes the data (including rownum) to another set of slaves (S->P) to do the parallel (degree 3) insert. The P->S at line 2 shows the parallel execution slaves passing details to the query co-ordinator of the private segments that they have created so that the query co-ordinator can stitch the segments together into a single data segment for the table. (If you watch the run-time activity closely you’ll see the query co-ordinator doing a few local writes as it tidies up the header blocks in those segment blocks.)

There are two threats to this rownum detail. The first, of course, is that the operation essentially serialises through the query co-ordinator so it’s going to take longer than you might expect; secondly an accident of this type is typically going to allocate twice as many parallel execution slaves as you might have expected – the select and the insert are two separate data flow operations (note how the Name column shows TQ1xxxx and TQ2xxxx), each gets its own slave sets, and both sets of slaves are held for the duration of the statement. If this statement is demanding twice the slaves it should be using, then you may find that other statements that start running at the same time get their degree of parallelism downgraded because you’ve run out of PX slaves. Although the rownum solution is nice and clean – it require no further infrastructure – you probably need to introduce a sequence (with a large cache) to get the same effect without losing parallelism.

If you look at v$pq_tqstat after running this statement the results are a little disappointing – there are a few problems connecting lines from the plan with rows in the view – here’s my original output (and you’ll now see why I chose to have two different degrees of parallelism):


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Consumer               1 P000                331330   39834186         74         71           0
                                             1 P001                331331   39844094         75         72           0
                                             1 P002                330653   39749806         74         71           0

                    1 Producer               1 P000                     1        131       2263        396           0
                                             1 P001                     1        131       2238        417           0
                                             1 P002                     1        131       2182        463           0

         2          0 Producer               1 P003                247652   28380762         13          0           0
                                             1 P004                228857   26200574         13          1           0
                                             1 P005                267348   30496182         14          0           0
                                             1 P006                249457   28401982         13          0           0
                                             1 QC                  993314  119428086 4294967269 4294967286           0
                      Consumer               1 QC                  993314  113479500        125         65           0

                    1 Consumer               1 QC                       3        393          2          1           0

The first problem is that the DFO_number reported in the view doesn’t match with the :TQ1xxxx and :TQ2xxxx reported in the plan – the parallel 4 bit is the select, which is covered by :TQ1000, but it’s listed under DFO_Number = 2 in the view, and the insert is the parallel 3 bit, which is covered by :TQ2000 and :TQ20001 but listed under DFO_Number = 1.

More confusingly, potentially, is that the all appearances of the query coordinator have been assigned to DFO_Number = 2. Ignoring the fact that the DFO_Number column switches the 1 and 2 from the plan, what we should see is as follows:

  • The consumer at line 16 is consuming from the 4 producers at lines 11 – 14.
  • The producer at line 15 is producing FOR the 3 consumers at lines 3 – 5
  • The consumer at line 18 is consuming from the producers at lines 7 – 9

Ideally (including the correction for the DFO_Number) I think the view content should be as follows:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P003                247652   28380762         13          0           0
                                             1 P004                228857   26200574         13          1           0
                                             1 P005                267348   30496182         14          0           0
                                             1 P006                249457   28401982         13          0           0
                      Consumer               1 QC                  993314  113479500        125         65           0

         2          0 Producer               1 QC                  993314  119428086 4294967269 4294967286           0
                      Consumer               1 P000                331330   39834186         74         71           0
                                             1 P001                331331   39844094         75         72           0
                                             1 P002                330653   39749806         74         71           0

                    1 Producer               1 P000                     1        131       2263        396           0
                                             1 P001                     1        131       2238        417           0
                                             1 P002                     1        131       2182        463           0
                      Consumer               1 QC                       3        393          2          1           0

Just don’t ask me why the waits and timeouts for the QC as producer seem to be counting backwards from 2^32.

10 Comments »

  1. Thanks. I have to look at this.

    Comment by Hemant K Chitale — February 12, 2015 @ 8:40 am GMT Feb 12,2015 | Reply

  2. Jonathan,
    some questions and comments without a particular order:
    When I repeat your example in 12.1.0.2 I get the following plan:

    explain plan for
    insert /*+ parallel(t2 3) */ into t2
    select /*+ parallel(t1 4) */ t1.*, rownum from t1;
    
    -- skipping some columns
    --------------------------------------------------------------------------------------------
    | Id  | Operation                          | Name     | Rows  |    TQ  |IN-OUT| PQ Distrib |
    --------------------------------------------------------------------------------------------
    |   0 | INSERT STATEMENT                   |          |   800K|        |      |            |
    |   1 |  PX COORDINATOR                    |          |       |        |      |            |
    |   2 |   PX SEND QC (RANDOM)              | :TQ10002 |   800K|  Q1,02 | P->S | QC (RAND)  |
    |   3 |    LOAD AS SELECT (HYBRID TSM/HWMB)| T2       |       |  Q1,02 | PCWP |            |
    |   4 |     OPTIMIZER STATISTICS GATHERING |          |   800K|  Q1,02 | PCWP |            |
    |   5 |      PX RECEIVE                    |          |   800K|  Q1,02 | PCWP |            |
    |   6 |       PX SEND ROUND-ROBIN          | :TQ10001 |   800K|  Q1,01 | S->P | RND-ROBIN  |
    |   7 |        BUFFER SORT                 |          |   800K|  Q1,01 | SCWP |            |
    |   8 |         COUNT                      |          |       |  Q1,01 | SCWP |            |
    |   9 |          PX RECEIVE                |          |   800K|  Q1,01 | SCWP |            |
    |  10 |           PX SEND 1 SLAVE          | :TQ10000 |   800K|  Q1,00 | P->S | 1 SLAVE    |
    |  11 |            PX BLOCK ITERATOR       |          |   800K|  Q1,00 | PCWC |            |
    |  12 |             TABLE ACCESS FULL      | T1       |   800K|  Q1,00 | PCWP |            |
    --------------------------------------------------------------------------------------------
    
    Note
    -----
       - Degree of Parallelism is 2 because of table property
    
    -- v$pq_tqstat
    DFO_NUMBER      TQ_ID SERVER_TYP PROCES   NUM_ROWS      BYTES      WAITS
    ---------- ---------- ---------- ------ ---------- ---------- ----------
             1          0 Producer   P000       268655   35691068         14
             1          0 Producer   P001       263095   34444432         14
             1          0 Producer   P002       268250   35671700         14
             1          0 Consumer   P003       800000  105807056          9
             1          0 Consumer   P004            0         72          9
             1          0 Consumer   P005            0         72          9
             1          1 Producer   P003       800000  110596006          0
             1          1 Producer   P004            0         72          0
             1          1 Producer   P005            0         72          0
             1          1 Consumer   P000       267000   36889893         25
             1          1 Consumer   P001       267000   36950475         25
             1          1 Consumer   P002       266000   36755782         25
             1          2 Producer   P000            2        270       2289
             1          2 Producer   P001            2        270       2291
             1          2 Producer   P002            2        270       2282
             1          2 Consumer   QC              6        810         81
    

    Though the plan says that the Degree of Parallelism is 2 the content from v$pq_tqstat seems to show a parallelism of 3 for all operations (and only one DFO). Are the differences to your plan (including the LOAD AS SELECT (HYBRID TSM/HWMB)) a result of different releases; or are can they be explained by different parameter settings? (I think my test system uses only default settings, but maybe I am missing something obvious; the degree for both tables is 1 according to USER_TABLES)

    Another (quite basic) question is: does the plan itself contain information on the degree of parallelism? I see that v$pq_tqstat contains it, but does the plan also include the parallel degree (execpt in the Note section; that seems to be misleading in my test).

    And a small detail: the insert statement is incomplete – though it’s not too difficult to guess the missing part.

    Regards

    Martin

    Comment by Martin Preiss — February 12, 2015 @ 8:57 am GMT Feb 12,2015 | Reply

    • Hi Martin, 12c introduces among other things adds the “1 SLAVE” distribution which obviously got introduced to avoid the decomposition into multiple DFO trees in such cases. I cover this in the videos mentioned in the comment to Jonathan, and I’ve also mentioned it in my “Parallel Execution 12c New Features Overview” post. I’ll also cover this new distribution method in an upcoming, separate post.

      Since this new distribution method results in a single DFO tree the rule applies that the degree specified for the Parallel DML part (the INSERT part) determines the overall degree used for the DFO tree – which should be then 3 in your example.

      This is actually an enhancement in 12c that you are supposed to see the degree in the “Notes” section, even for the cases where AUTO DOP doesn’t get used (which mentions the degree already in 11.2 in the notes section). Obviously this new feature doesn’t work correctly in all cases :-).

      It’s important to point out that the “1 SLAVE” distribution doesn’t change the fact that some part of the execution plan is still executed only by one session, so effectively serializes that part. It just changes the internal mechanisms and avoids the overhead and side effects of the “multiple DFO tree” scenario.

      Randolf

      Comment by Randolf Geist — February 12, 2015 @ 10:00 am GMT Feb 12,2015 | Reply

      • Randolf,

        It’s interesting that Oracle has decided to use this “1 Slave” option – given that it uses a buffered operation it looks even worse than the two-DFO solution from 11g which allowed the QC to read from one set of slaves and write immediately to the second set.

        Comment by Jonathan Lewis — February 12, 2015 @ 10:23 am GMT Feb 12,2015 | Reply

        • Jonathan,

          yes, that’s true and an important point: By merging everything in a single DFO tree the still valid limitation applies that only a single redistribution is allowed to be active at the same time, hence additional BUFFER SORT or HASH JOIN BUFFERED operations might be necessary due to this limitation. The 1 SLAVE distribution avoids the issues and side effects of having potentially different degrees and more PX slaves allocated than expected due to multiple DFO trees being active concurrently, but it might then lead to such issues as a side effect of having everything in a single DFO tree.

          Randolf

          Comment by Randolf Geist — February 12, 2015 @ 1:20 pm GMT Feb 12,2015

    • Martin,
      I think Randolf has answered your questions – but the “hybrid” should give you the clue that that’s a 12c “runtime optimisation” choice. Check the parameter list;

      _px_hybrid_TSM_HWMB_load    Enable Hybrid Temp Segment Merge/High Water Mark Brokered load method 
      

      Thanks for the comment about the broken text – now fixed,

      Comment by Jonathan Lewis — February 12, 2015 @ 10:13 am GMT Feb 12,2015 | Reply

  3. Jonathan,

    regarding where this is documented: Nowhere officially I believe, but I talk about and demonstrate these threats in my videos on that topic. The decomposition into multiple DFO trees is covered in the “Analysing Parallel Execution Skew – Data Flow Operations (DFOs) And DFO Trees” part, and the defects in the population of V$PQ_TQSTAT for such complex plans is covered in “Analysing Parallel Execution Skew – Without Diagnostics / Tuning Pack License”. Both can be found on my blog and Youtube channel.

    I also cover in those videos what changes in that regard in 12c (and there are fortunately a lot of changes).

    Randolf

    Comment by Randolf Geist — February 12, 2015 @ 9:48 am GMT Feb 12,2015 | Reply

  4. Thanks, this was very useful indeed, hard to find this information elsewhere.

    Comment by Alexandre — November 6, 2015 @ 10:40 am GMT Nov 6,2015 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.