Oracle Scratchpad

July 27, 2014

Parallel Plans

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 8:39 pm BST Jul 27,2014

I’ve popped this note to the top of the stack because I’ve added an index to Randolf Geist’s series on parallel execution skew, and included a reference his recent update to the XPLAN_ASH utility.

This is the directory for a short series I wrote discussing how to interpret parallel execution plans in newer versions of Oracle.

For other aspects of parallel execution, here are links to several articles by Randolf Geist, published on his own blog or on Oracle’s Technet:

One of the awkward problems you can encounter with parallel execution is data skew – which has the potential to make just one slave in a set do (almost) all the work hence reducing the performance to something close to serial execution times.  has written a series of articles on Parallel Skew that has been published by AllthingsOracle over the last few months.

And a summary posting from Randolf listing the 5 articles above, but also including a set of short videos on the topic.

 

March 10, 2014

Parallel Execution – 5

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 1:30 pm BST Mar 10,2014

In the last article (I hope) of this series I want to look at what happens when I change the parallel distribution method on the query that I’ve been using in my previous demonstrations.  This was a query first introduced in a note on Bloom Filters (opens in a separate window) where I show two versions of a four-table parallel hash join, one using using the broadcast distribution mechanism throughout, the other using the hash distribution method. For reference you can review the table definitions and plan (with execution stats) for the serial join in this posting (also opens in a separate window).

To change distribution methods from the broadcast example to the hash example I’ve simply changed a few hints in my code. Here are two sets of hints showing what I’ve done; the first is a repeat from the third article showing the broadcast example, the second shows the small change needed to get the hash example:


/*+
    leading(t4 t1 t2 t3)
    full(t4) parallel(t4, 2)
    use_hash(t1) swap_join_inputs(t1) pq_distribute(t1 none broadcast)
    full(t1) parallel(t1, 2)
    use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 none broadcast)
    full(t2) parallel(t2, 2)
    use_hash(t3) swap_join_inputs(t3) pq_distribute(t3 none broadcast)
    full(t3) parallel(t3, 2)
    monitor
*/

/*+
    leading(t4 t1 t2 t3)
    full(t4) parallel(t4, 2)
    use_hash(t1) swap_join_inputs(t1) pq_distribute(t1 hash hash)
    full(t1) parallel(t1, 2)
    use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 hash hash)
    full(t2) parallel(t2, 2)
    use_hash(t3) swap_join_inputs(t3) pq_distribute(t3 hash hash)
    full(t3) parallel(t3, 2)
    monitor
*/

Because of the combination of leading() hint with the use_hash() and swap_join_inputs() hints the plan WILL still build in-memory hash tables from t1, t2, and t3 and it WILL still probe each hash table in turn with the rows (that survive) from t4; but the order of activity in the hash distribution plan will be dramatically different from the order in the serial and parallel broadcast plans where the order in which Oracle actually built the in-memory hash tables t3, t2, t1.

Here – with a little cosmetic adjustment – is the parallel execution plan using hash distribution on 11.2.0.4, captured from memory with rowsource execution stats enabled (the 12c plan would report PX SEND HYBRID HASH” operators with an associated “STATISTICS COLLECTOR” operator showing that adaptive execution was a possibility – with three points at which the plan might switch from hash distribtion to broadcast):


--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name     | Starts | Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |          |      1 |   437 (100)|          |        |      |            |      1 |00:00:00.08 |      16 |      5 |
|   1 |  SORT AGGREGATE                    |          |      1 |            |          |        |      |            |      1 |00:00:00.08 |      16 |      5 |
|   2 |   PX COORDINATOR                   |          |      1 |            |          |        |      |            |      2 |00:00:00.08 |      16 |      5 |
|   3 |    PX SEND QC (RANDOM)             | :TQ10006 |      0 |            |          |  Q1,06 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |
|   4 |     SORT AGGREGATE                 |          |      2 |            |          |  Q1,06 | PCWP |            |      2 |00:00:00.01 |       0 |      0 |
|*  5 |      HASH JOIN                     |          |      2 |   437   (3)| 00:00:03 |  Q1,06 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|   6 |       JOIN FILTER CREATE           | :BF0000  |      2 |     2   (0)| 00:00:01 |  Q1,06 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|   7 |        PX RECEIVE                  |          |      2 |     2   (0)| 00:00:01 |  Q1,06 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|   8 |         PX SEND HASH               | :TQ10004 |      0 |     2   (0)| 00:00:01 |  Q1,04 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|   9 |          PX BLOCK ITERATOR         |          |      2 |     2   (0)| 00:00:01 |  Q1,04 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 10 |           TABLE ACCESS FULL        | T3       |      2 |     2   (0)| 00:00:01 |  Q1,04 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  11 |       PX RECEIVE                   |          |      2 |   435   (3)| 00:00:03 |  Q1,06 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|  12 |        PX SEND HASH                | :TQ10005 |      0 |   435   (3)| 00:00:03 |  Q1,05 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  13 |         JOIN FILTER USE            | :BF0000  |      2 |   435   (3)| 00:00:03 |  Q1,05 | PCWP |            |     27 |00:00:00.01 |       0 |      0 |
|* 14 |          HASH JOIN BUFFERED        |          |      2 |   435   (3)| 00:00:03 |  Q1,05 | PCWP |            |    630 |00:00:00.01 |       0 |      0 |
|  15 |           JOIN FILTER CREATE       | :BF0001  |      2 |     2   (0)| 00:00:01 |  Q1,05 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  16 |            PX RECEIVE              |          |      2 |     2   (0)| 00:00:01 |  Q1,05 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  17 |             PX SEND HASH           | :TQ10002 |      0 |     2   (0)| 00:00:01 |  Q1,02 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  18 |              PX BLOCK ITERATOR     |          |      2 |     2   (0)| 00:00:01 |  Q1,02 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 19 |               TABLE ACCESS FULL    | T2       |      2 |     2   (0)| 00:00:01 |  Q1,02 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  20 |           PX RECEIVE               |          |      2 |   432   (3)| 00:00:03 |  Q1,05 | PCWP |            |    632 |00:00:00.01 |       0 |      0 |
|  21 |            PX SEND HASH            | :TQ10003 |      0 |   432   (3)| 00:00:03 |  Q1,03 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  22 |             JOIN FILTER USE        | :BF0001  |      2 |   432   (3)| 00:00:03 |  Q1,03 | PCWP |            |    632 |00:00:00.09 |       0 |      0 |
|* 23 |              HASH JOIN BUFFERED    |          |      2 |   432   (3)| 00:00:03 |  Q1,03 | PCWP |            |  14700 |00:00:00.09 |       0 |      0 |
|  24 |               JOIN FILTER CREATE   | :BF0002  |      2 |     2   (0)| 00:00:01 |  Q1,03 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  25 |                PX RECEIVE          |          |      2 |     2   (0)| 00:00:01 |  Q1,03 | PCWP |            |      3 |00:00:00.01 |       0 |      0 |
|  26 |                 PX SEND HASH       | :TQ10000 |      0 |     2   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  27 |                  PX BLOCK ITERATOR |          |      2 |     2   (0)| 00:00:01 |  Q1,00 | PCWC |            |      3 |00:00:00.01 |       4 |      2 |
|* 28 |                   TABLE ACCESS FULL| T1       |      2 |     2   (0)| 00:00:01 |  Q1,00 | PCWP |            |      3 |00:00:00.01 |       4 |      2 |
|  29 |               PX RECEIVE           |          |      2 |   427   (2)| 00:00:03 |  Q1,03 | PCWP |            |  14700 |00:00:00.08 |       0 |      0 |
|  30 |                PX SEND HASH        | :TQ10001 |      0 |   427   (2)| 00:00:03 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |
|  31 |                 JOIN FILTER USE    | :BF0002  |      2 |   427   (2)| 00:00:03 |  Q1,01 | PCWP |            |  14700 |00:00:00.05 |    6044 |   6018 |
|  32 |                  PX BLOCK ITERATOR |          |      2 |   427   (2)| 00:00:03 |  Q1,01 | PCWC |            |  14700 |00:00:00.04 |    6044 |   6018 |
|* 33 |                   TABLE ACCESS FULL| T4       |     26 |   427   (2)| 00:00:03 |  Q1,01 | PCWP |            |  14700 |00:00:00.04 |    6044 |   6018 |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T3"."ID"="T4"."ID3")
  10 - access(:Z>=:Z AND :Z<=:Z)        filter((TO_NUMBER("T3"."SMALL_VC")=1 OR TO_NUMBER("T3"."SMALL_VC")=2 OR TO_NUMBER("T3"."SMALL_VC")=3))   14 - access("T2"."ID"="T4"."ID2")   19 - access(:Z>=:Z AND :Z<=:Z)        filter((TO_NUMBER("T2"."SMALL_VC")=1 OR TO_NUMBER("T2"."SMALL_VC")=2 OR TO_NUMBER("T2"."SMALL_VC")=3))   23 - access("T1"."ID"="T4"."ID1")   28 - access(:Z>=:Z AND :Z<=:Z)        filter((TO_NUMBER("T1"."SMALL_VC")=1 OR TO_NUMBER("T1"."SMALL_VC")=2 OR TO_NUMBER("T1"."SMALL_VC")=3))   33 - access(:Z>=:Z AND :Z<=:Z)
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T4"."ID1"))

There are a couple of significant points that are very easy to point out in this plan. First, we have a number of lines which are “BLOOM FILTER CREATE/USE” lines that did not appear in the broadcast plan; second that we can only see one sys_op_bloom_filter() in the predicate section rather than three (don’t worry, it’s – partly – a reporting defect); finally we have seven virtual tables (table queues :TQnnnnn) in this plan rather than four, and those virtual tables seems to be scattered rather more randomly around the plan.

To make it easier to understand what’s happened with a parallel execution plan, I usually also dump out the contents of v$pq_tqstat after running the query – so here’s the result after running the above:

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P002                     3         69          1          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                     2         62         30         16           0
                                             1 P001                     1         55         26         14           0

                    1 Producer               1 P002                  1476      35520          2          1           0
                                             1 P003                 13224     317880          1          0           0
                      Consumer               1 P000                  9800     235584         20         14           0
                                             1 P001                  4900     117816         20         14           0

                    2 Producer               1 P000                     3         69          0          0           0
                                             1 P001                     0         48          0          0           0
                      Consumer               1 P002                     2         62         33         19           0
                                             1 P003                     1         55         32         19           0

                    3 Producer               1 P000                   422       9754          0          0           0
                                             1 P001                   210       4878          0          0           0
                      Consumer               1 P002                   420       9708         33         19           0
                                             1 P003                   212       4924         32         18           0

                    4 Producer               1 P002                     3         69          1          0           0
                                             1 P003                     0         48          0          0           0
                      Consumer               1 P000                     2         62         42         20           0
                                             1 P001                     1         55         39         15           0

                    5 Producer               1 P002                    18        444          0          0           0
                                             1 P003                     9        246          0          0           0
                      Consumer               1 P000                    18        444         41         20           0
                                             1 P001                     9        246         39         16           0

                    6 Producer               1 P000                     1         60          0          0           0
                                             1 P001                     1         60          0          0           0
                      Consumer               1 QC                       2        120          1          0           0

So let’s work our way through the execution plan – if you want to put the plan and my comments side by side, this link will re-open this article in a second window.

Given the set of hints, and the intent I expressed at the start of the series, we hope to see Oracle building an in-memory hash table from each of t1, t2 and t3 in that order, following which it will scan t4, probe t3, t2, and t1 in that order, and then aggregate the result.  Let’s check that using the parallel plan rule of “follow the table queues”.

Table queue 0 covers lines 26 – 28, we scan t1 and distribute it by hash.  We can see from the A-Rows column we found 3 rows and distributed them and if we look at the output from v$pq_tqstat we find it matches – slaves 2 and 3 produced 3 rows, slaves 0 and 1 consumed 3 rows. Table queue 1 covers lines 30 – 33, we scan t4 and distribute it by hash. We can see from the A-rows column we found 14,700 rows and distributed them, and again we can see the match in v$pq_tqstat – slaves 2 and 3 produced 14,700 rows and distributed them to slaves 0 and 1. But there’s an oddity here, and things start to  get messy: from the predicate section we can see that we applied a Bloom filter on the ID1 column on the data we got from the tablescan, and the plan itself shows a Bloom filter (:BF0002) being used at line 31, but that Bloom filter is created at line 24 of the plan and line 24 has been associated with table queue 3. Now I know (because I constructed the data) that a perfect filter has been created and used at that point because 14,700 rows is exactly the volume of data that should eventually join between tables t1 and t4.  It’s reasonable, I think, to say that the boundary between table queues 0 and 3 is a little blurred at lines 24/25 – the slaves that are going to populate table queue 3 are the ones that created the Bloom filter, but they’re not going to populate table queue 3 just yet.

So let’s move on to table queue 2. This covers lines 17-19 (looking at the TQ column) except I’m going to assume the same blurring of boundaries I claimed for table queue 0 – I’m going to say that table queue 2 expands into lines 15-19 (bringing in the PX RECEIVE and JOIN FILTER CREATE (:BF001). So our next step is to scan and distribute table t2, and build a Bloom filter from it. Again we look at v$pq_tqstat and see that in this case it’s slaves 0 and 1 which scan the table and distribute 3 rows to slaves 2 and 3, and we assume that slaves 2 and 3 will send a Bloom filter back to salves 0 and 1.

Now we can move on to table queue 3: line 21 writes to table queue 3 by using lines 22, 23, 24, 25, and 29 according to the TQ column (but thanks to the blurring of the boundaries lines 24 and 25 were used “prematurely” to create the Bloom filter :BF002 describing the results from table t1). So lines 24/25 read table queue 0 and built an in-memory hash table, simultaneously creating a Bloom filter and sending it back to slaves 2 and 3; then line 23 did a HASH JOIN BUFFERED, which means it copied the incoming data from table queue 1 (slaves 2 and 3, table t4)  into a buffer and then used that buffer to probe its in-memory hash table and do the join; then line 22 applied a Bloom filter (:BF001) to the result of the hash join although the filter won’t appear in the predicate section until version 12.1.0.1. Notice that line 23 (the join) produced 14,700 rows, demonstrating that our previous filter was a perfect filter, and then line 22 filtered out all but 632 rows. (Again, because I constructed the data I can tell you that the second Bloom filter has also worked with 100% accuracy – although v$pq_tqstat seems to show an extra 2 rows which I can’t account for and which don’t appear in the trace file).

So here’s another problem – we’re using another Bloom filter that we haven’t yet (apparently) created unless we accept my assumption of the blurring of the boundary at lines 15 and 16, where the plan shows two lines associated with table queue 5 even though I need them to be associated with table queue 2 so that they can produce the Bloom filter needed by table queue 3. Again, by the way, we can do the cross-check with the TQ_ID 3 of v$pq_tqstat abnd see slaves 0 and 1 produced 632 rows and sent them to slaves 2 and 3.

Before continuing, lets rewrite the action so far as a series of bullet points:

  • Slaves 2,3 scan t1 and distribute to slaves 0,1
  • Slaves 0,1 build an in-memory hash table and a Bloom filter (:BF002) for t1, and send the filter to slaves 2,3
  • Slaves 2,3 scan t4, use the Bloom filter (:BF002) to eliminate data (luckily 100% perfectly) and distribute the remaining rows to slaves 0,1
  • Slaves 0,1 buffer the incoming data
  • Slaves 0,1 scan t2 and distribute to slaves 2,3
  • Slaves 2,3 build an in-memory hash table for the results from t2 and a Bloom filter (:BF001) for t2, and send the filter to slaves 0,1
  • Slaves 0,1 use the buffered t4 to probe the in-memory hash of t1 to do the join, testing join results  against the Bloom filter (:BF001) for t2, and distributing the surviving rows to slaves 2,3

The pattern of the last four steps will then repeat for the next hash join – and for longer joins the patten will repeat up to, but excluding, the last join.

  • Slaves 2,3 buffer the incoming data (the result of joining t4, t1 and t2) – the buffering is implied by line 4 (which is labelled as an input for table queue 5)
  • Slaves 2,3 scan t3 and distribute to slaves 0,1 (reading lines 8,9,10 of the plan), cross-checking with TQ_ID 4 of v$pq_tqstat
  • Slaves 0,1 build an in-memory hash table for the results from t3 and a Bloom filter (:BF000) for t3, and send the filter to slaves 2,3 (“sharing” lines 6 and 7 from table queue 6)
  • Slaves 2,3 use the buffered results from (t4/t1) to probe the in-memory hash to t2 to do the join, testing join results against the Bloom filter (:BF000) for t3, and distributing the surviving rows to slaves 0,1.

Again, we can check row counts – the hash join buffered at line 14 shows 630 rows coming from the hash join (i.e. the previous Bloom filter was perfect), and line 13 shows 27 rows surviving the final Bloom filter. Again my knowledge of the data tells me that the Bloom filter was a perfect filter. Cross-checking to TQ_ID 5 of v$pq_tqstat we see slaves 2 and 3 producing 27 rows and slaves 0 and 1 consuming them.

So at this point slaves 0,1 have an in-memory hash table for t3, and are receiving the filtered results of the join between t4, t1, and t2; the slaves have to join and aggregate the the two data sets before forwarding a result to the query co-ordinator. Since the aggregation is a blocking operation (i.e. slaves 0,1 can send data to the co-ordinator until they’ve emptied virtual table 5 and aggregated all the incoming data) they don’t have to use the “hash join buffered” mechanism, so the pattern for the final part of the plan changes.

Lines 5, 6, 7, 11 show us the hash join (not buffered) with its two inputs (although lines 6 and 7 have, of course, been mentioned once already as the source of the Bloom filter used at line 13). Then line 4 shows slaves 0 and 1 aggregating their results; line 3 shows them forwarding the results to the query co-ordinator, line 2 shows the query co-ordinator receiving the results and line 1 shows it aggregating across the slave results ready to send to the end-user.

It’s a bit complicated, and the constant jumping back and fore through the execution plan lines (especially for the “shared” usage of the Bloom filter creation lines) makes it quite hard to follow, so I’ve drawn up a Powerpoint slide to capture the overall picture:
px_plan

I’ve put the slaves 0 and 1 at the top of the picture, slaves 2 and 3 at the bottom, with the query co-ordinator in the middle at the right hand side. Time reads across the page from left to right, and that gives you the order in which data moves through table queues (and back, for Bloom filters). The annotation give you some idea of what data is moving. Note that I’ve used B1 to refer to the Bloom filter on table T1 (and ignored the numbering on Oracle’s :BFnnn entries). I’ve used red to highlight the data sets that are buffered, and put in curved arrows to show where the buffered data is subsequently brought back into play. I did try to add the various plan line numbers to the picture, but the volume of text made the whole thing incomprehensible – so I’ve left it with what I think is the best compromise of textual information and graphical flow.

I’ll just leave one final warning – if you want to reproduce my results, you’ll have to be careful about versions. I stuck with 11.2.0.4 as that’s the latest version of the most popular general release. There are differences in 12.1.0.1, and there are differences again if you try to emulate 11.2.0.4 by setting the optimizer_features_enable in 12.1.0.1 back to the earlier version.

March 5, 2014

12c pq_replicate

Filed under: 12c,Exadata,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 5:39 pm BST Mar 5,2014

Another day, another airport lounge – another quick note: one of the changes that appeared in 12c was a tweak to the “broadcast” distribution option of parallel queries. I mentioned this in a footnote to a longer article a couple of months ago; this note simply expands on that brief comment with an example. We’ll start with a simple two-table hash join – which I’ll first construct and demonstrate in 11.2.0.4:


create table t1
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	rownum				n1,
	lpad(rownum,6,'0')		small_vc,
	lpad(rownum,200,'0')		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1000
;

create table t2
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	1 + mod(rownum,10000)			n1,
	lpad(1 + mod(rownum,10000),6,'0')	small_vc,
	lpad(rownum,500,'0')			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 20000 ;

-- collect stats, no histograms.

select
  	/*+
  		leading(t1 t2)
 		parallel(t1 2)
 		parallel(t2 2)
 		use_hash(t2)
 	*/
 	t1.padding,
 	t2.padding
from 	t1, t2
where	t2.n1 = t1.n1
and	t2.small_vc = t1.small_vc
;

-------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10001 |  1000 |   707K|   135 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN            |          |  1000 |   707K|   135 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE          |          |  1000 |   207K|     4 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST  | :TQ10000 |  1000 |   207K|     4 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR |          |  1000 |   207K|     4 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR   |          | 20000 |     9M|   131 |  Q1,01 | PCWC |            |
|   9 |      TABLE ACCESS FULL  | T2       | 20000 |     9M|   131 |  Q1,01 | PCWP |            |
-------------------------------------------------------------------------------------------------

In this plan slave set 2 scans table t1 in parallel and broadcasts the result set to slave set 1 (lines 5 – 7). The significance of the broadcast option is that each slave in slave set 2 sends all the rows it has read to every slave in slave set 1. For a fairly large table with a high degree of parallelism this could be a lot of inter-process communication; the total number of rows passing through the PX message pool is “DOP x number of row filtered from t1″.

After a slave in slave set 1 has receive the whole of the t1 result set it builds an in-memory hash table and starts scanning rowid ranges (PX BLOCK ITERATOR) from table t2, probing the in-memory hash table to effect the join (lines 3,4, 8,9). Since each slave has a copy of the whole result set from t1 it can scan any chunk of t2 and handle the contents locally. Moreover, because slave set 1 isn’t reading its second input from a virtual table it is able to write its output immediately the virtual table (:TQ10001) that feeds the query coordinator with the result (lines 1,2) – we don’t have to do a “hash join buffered” operation and buffer the entire second input before starting to execute the join.

So how does 12c change things. With the same starting data and query, here’s the execution plan:

-----------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  1000 |   707K|   135 |        |      |            |
|   1 |  PX COORDINATOR       |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  1000 |   707K|   135 |  Q1,00 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN          |          |  1000 |   707K|   135 |  Q1,00 | PCWP |            |
|   4 |     TABLE ACCESS FULL | T1       |  1000 |   207K|     4 |  Q1,00 | PCWP |            |
|   5 |     PX BLOCK ITERATOR |          | 20000 |     9M|   131 |  Q1,00 | PCWC |            |
|   6 |      TABLE ACCESS FULL| T2       | 20000 |     9M|   131 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------

Notice, in particular, that we only have one virtual table (or table queue :TQ10000) rather than two – and that’s from a parallel query slave set to the query co-ordinator, parallel to serial; the query only uses one set of parallel query slaves. Until you run the query with rowsource execution statistics enabled and look at the output from v$pq_tqstat it’s not going to be immediately obvious what has happened, but we should see that somehow Oracle is no longer broadcasting the first table even though it’s still doing something in parallel with both tables.

The run-time statistics confirm that we’ve only used one set of slaves, and each slave in the slave set has scanned the whole of table t1. This means each slave can build the full hash table and then go on to read rowid ranges from table t2. We’ve managed to get the benefit of broadcasting t1 (every slave has the whole of t1 so we don’t have to scan and distribute the big table t2 through the PX message pool) but we haven’t had to clone it multiple times through the PX message pool.

Clearly there’s a trade-off here that Oracle Corp. has decided is worth considering. I’m guessing it’s biased towards Exadata where you might run queries with a very high degree of parallelism. In that case the overhead of task switching as large numbers of messages are passed around may (and this is pure supposition) be greater than the added cost of loading the table into the buffer cache (of each instance) and having each slave scan it from there. (Reminder – 11g introduced two “opposite” changed to tablescans: “serial direct reads” and “parallel in-memory scans”.)

There’s one little oddity in this replication – there’s a pair of hints: pq_replicate and no_pq_replicate to control the effect if you think the optimizer is making the wrong choice. I would have guessed that in my example the hint would read: /*+ pq_replicate(t1) */ as it’s table t1 that is read by every single slave. Strangely, though, this is what the outline section of the execution plan showed:


  /*+
      BEGIN_OUTLINE_DATA
      PQ_REPLICATE(@"SEL$1" "T2"@"SEL$1")
      PQ_DISTRIBUTE(@"SEL$1" "T2"@"SEL$1" BROADCAST NONE)
      USE_HASH(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      OUTLINE_LEAF(@"SEL$1")
      ALL_ROWS
      OPT_PARAM('_optimizer_cost_model' 'io')
      DB_VERSION('12.1.0.1')
      OPTIMIZER_FEATURES_ENABLE('12.1.0.1')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Notice how the hint specifies table t2, not table t1 !

Footnote

Here’s a little anomaly,  and a generic warning about “optimizer_features_enable”: I found that if I used the hint /*+ optimizer_features_enable(‘11.2.0.4′) */ in 12c I could still get the pq_replicate() hint to work. Unfortunately there are a few places where the hint (or parameter) isn’t guaranteed to take the optimizer code backwards the full 100%.

February 28, 2014

Empty Hash

Filed under: Bugs,CBO,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:45 pm BST Feb 28,2014

A little while ago I highlighted a special case with the MINUS operator (that one of the commentators extended to include the INTERSECT operator) relating to the way the second subquery would take place even if the first subquery produced no rows. I’ve since had an email from an Oracle employee letting me know that the developers looked at this case and decided that it wasn’t feasible to address it because – taking a wider view point – if the query were to run parallel they would need a mechanism that allowed some synchronisation between slaves so that every slave could find out that none of the slaves had received no rows from the first subquery, and this was going to lead to hanging problems.

The email reminded me that there’s another issue of the same kind that I discovered several years ago – I thought I’d written it up, but maybe it was on a newsgroup or forum somewhere, I can’t find it on my blog or old website). The problem can be demonstrated by this example:

(more…)

February 26, 2014

Parallel Execution – 4

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 1:58 pm BST Feb 26,2014

I’m aware that in the previous article in this series I said I’d continue “in a few days” and it has now been more like 11 weeks – but finally I’ve got the time. In this article I’m going to talk primarily about Bloom filters and their impact on performance, but I’ll need to say something about the “virtual tables” and “parallel execution message size” before I begin. Take a look at this fragment of a parallel execution plan:

(more…)

December 8, 2013

Parallel Execution – 3

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 10:09 pm BST Dec 8,2013

It’s finally time to take a close look at the parallel versions of the execution plan I produced a little while ago for a four-table hash join. In this note I’ll examine the broadcast parallel distribution. First, here’s a list of the hints I’m going to use to get the effect I want:

	/*+
		leading(t4 t1 t2 t3)
		full(t4) parallel(t4, 2)
		use_hash(t1) swap_join_inputs(t1) pq_distribute(t1 none broadcast)
		full(t1) parallel(t1, 2)
		use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 none broadcast)
		full(t2) parallel(t2, 2)
		use_hash(t3) swap_join_inputs(t3) pq_distribute(t3 none broadcast)
		full(t3) parallel(t3, 2)
		monitor
	*/

(more…)

November 5, 2013

Deadlock

Filed under: Oracle,Parallel Execution,Partitioning — Jonathan Lewis @ 6:14 am BST Nov 5,2013

There an interesting example of a deadlock on the OTN database forum:

DEADLOCK DETECTED ( ORA-00060 )
[Transaction Deadlock]

Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
PS-00000001-00000011        92     423     S             33     128     S     X
BF-2ed08c01-00000000        33     128     S             92     423     S     X

(more…)

October 14, 2013

Parallel Execution – 2

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:29 pm BST Oct 14,2013

Since I’m going to write a couple of articles dissecting parallel execution plans, I thought I’d put up a reference post describing the set of tables I used to generate the plan, and the query (with serial execution plan) that I’ll be looking at. The setup is a simple star schema arrangement – which I’ve generated by created by creating three identical tables and then doing a Cartesian join across the three of them.

(more…)

October 13, 2013

Parallel Execution – 1

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 7:43 pm BST Oct 13,2013

When you read an execution plan you’re probably trying to identify the steps that Oracle went through to acquire the final result set so that you can decide whether or not there is a more efficient way of getting the same result.

For a serial execution plan this typically means you have to identify the join order, join methods and access methods together with the point at which each predicate was applied. Generally this isn’t too difficult, although subquery pushing (and a couple of presentation bugs relating to other uses of scalar subqueries) can cause a little confusion; and the difference between join order and the order of operation can be slightly obscured when considering hash joins.

Parallel execution plans are harder, though, because you really need to understand the impact of the order of operation, distribution mechanisms chosen, and (in recent versions of Oracle) the timing of the generation and use of Bloom filters. The topic is stunningly large and easy to describe badly; it’s also going to be easy to make generalisations that turn out to be untrue or (at least) sufficiently incomplete as to be misleading. Rather than attempting to cover the topic in one note, I think I’m going to end up writing two or three.

(more…)

September 13, 2013

Quiz Night

Filed under: Exadata,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:32 pm BST Sep 13,2013

Here’s a little quiz about Bloom filtering. There seem to be at least three different classes of query where Bloom filters can come into play – all involving hash joins: partition elimination, aggregate reduction on non-mergeable aggregate views, and parallelism.

This quiz is about parallel queries – and all you have to do is work out how many Bloom filters were used in the following two execution plans (produced by 11.2.0.2), and where they were used.

(more…)

September 9, 2013

Parallel Execution

Filed under: Oracle,Parallel Execution,Performance — Jonathan Lewis @ 6:51 am BST Sep 9,2013

While checking out potential scalability threats recently on a client system, I was directed to a time-critical task that was currently executing the same PL/SQL procedure 16 times (with different parameters) between 6:00 and 7:00 am; as the system went through its next phase of expansion the number of executions of this procedure was likely to grow. An interesting detail, though, was that nothing else was going on while the task was running so the machine (which had 6 cores) was running at 16% CPU.

An obvious strategy for handling the required growth target was to make sure that four (possibly 5) copies of the procedure were allowed to run concurrently. Fortunately the different executions were completely independent of each other and didn’t interfere with each other’s data, so the solution simply required a mechanism to control the parallelism. Conveniently 11gR2 gave us one.
(more…)

July 25, 2013

Parallel to Serial

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 5:21 pm BST Jul 25,2013

Here’s a little problem that came up on the Oracle-L listserver today:

I’m trying to write a query which reads the corresponding partition of the fact, extracts the list of join keys, materialises this result set, and finally joins the necessary dimensions. The key thing I’m trying to do is to run the initial query on the fact in parallel and then the rest of the query serially.

The full requirement, if you follow the link, may seem a little puzzling but there’s no point in second-guessing every question that people ask – there’s usually a reason for doing something in a particular way – so I just rattled off the first thing that came to mind, which was this:  when you include rownum in a parallel query Oracle has to serialise to generate the rownum – so create an inline view which does the parallel but adds a rownum to the select list, then join to the inline view. The plan should include a VIEW operator holding the parallel bit, and then you can hint as necessary to make the subsequent activity serial.

(more…)

July 11, 2013

12c Top N (px)

Filed under: 12c,Oracle,Parallel Execution — Jonathan Lewis @ 9:04 am BST Jul 11,2013

A comment from Greg Rahn in response to my posting yesterday prompted me to do a quick follow-up (test time ca. 3 minutes, write-up time, ca. 50 minutes – thanks for the temptation, Greg ;). Greg asked if the “Top N” would push down for a parallel query, so all I had to do was re-run my script with a parallel hint in place.  (Such is the joy of constructing test cases – when you get a new idea you may not need to do much work to test it.)

(more…)

July 3, 2013

maxthr – 3

Filed under: CBO,Oracle,Parallel Execution,Statistics,System Stats — Jonathan Lewis @ 6:29 pm BST Jul 3,2013

In part 1 of this mini-series we looked at the effects of costing a tablescan serially and then parallel when the maxthr and slavethr statistics had not been set.

In part 2 we looked at the effect of setting just the maxthr - and this can happen if you don’t happen to do any parallel execution while the stats collection is going on.

In part 3 we’re going to look at the two variations the optimizer displays when both statistics have been set. So here are the starting system stats:

begin
	dbms_stats.delete_system_stats;
	dbms_stats.set_system_stats('MBRC',        64);
	dbms_stats.set_system_stats('MREADTIM',    10);
	dbms_stats.set_system_stats('SREADTIM',     5);
	dbms_stats.set_system_stats('CPUSPEED',  2000);
	dbms_stats.set_system_stats('MAXTHR',  262144);
	dbms_stats.set_system_stats('SLAVETHR', 65536);
	dbms_stats.set_system_stats('SLAVETHR', 47000);
	dbms_stats.set_system_stats('SLAVETHR', 16384);
end;
/

You’ll notice that I’ve shown three options for slavethr so, when running the tests, I will be commenting out two of them. The middle value is the important one as I’ve set it just below a critical breakpoint. You’ll recall that the optimizer is programmed to behave as if a parallel slave will operate at 90% of the speed of a serial process. If we take the 64 block read, at 8KB per block, completed in 10 ms, this represents 52428.8 bytes per ms. 90% of that is 47,186 bytes per ms – hence the choice for slavethr in the second of the tests.

Remember that a serial tablescan of my data had an I/O cost of 1,251 (or 1,250 is you ignore the “tablescan cost plus 1″ effect) and that we could investigate the parallel costs by reference to the original serial cost compared to the degree of parallelism. We’re going to do that again, but in this case I’m going to run my tablescan just once (at parallel degree 5) for each of the three values of slavethr (lowest to highest) in turn.

Here are the resulting execution plans:

slavethr=16384
----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     5 |   800   (0)| 00:00:05 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 40000 |   195K|   800   (0)| 00:00:05 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 40000 |   195K|   800   (0)| 00:00:05 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

   IO_COST   CPU_COST       COST
---------- ---------- ----------
       800    1333333        800

slavethr=47000
----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     5 |   279   (0)| 00:00:02 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 40000 |   195K|   279   (0)| 00:00:02 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 40000 |   195K|   279   (0)| 00:00:02 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

   IO_COST   CPU_COST       COST
---------- ---------- ----------
       279    1333333        279

slavethr=65536
----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     5 |   278   (0)| 00:00:02 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 40000 |   195K|   278   (0)| 00:00:02 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 40000 |   195K|   278   (0)| 00:00:02 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

13 rows selected.

   IO_COST   CPU_COST       COST
---------- ---------- ----------
       278    1333333        278

As a starting point, we can say that the modified cost is always going to be: 1250 * serial throughput rate / parallel throughput rate where, in this test suite, the serial throughput rate in bytes per ms is 64 * 8K / 10 = 52428.8

Working from the top down:
When slavethr = 16384 the aggregate throughput rate is 5 * 16384 = 81920, so the I/O cost should be 1250 * 52428.8/81920 = 800 (Q.E.D)

When slavethr = 41000 the aggregate throughput rate is 5 * 47000 = 235,000 so the I/O cost should be 1250 * 52428.8/205000 = 279 (Q.E.D) You’ll notice that this is very close to the figure I had from the first test when I didn’t have maxthr or slavethr set and the optimizer used its “90% of serial” trick.

When slavethr = 65536, something odd has happened – instead of a significant change in I/O cost, the result actually matches the figure we got slavethr wasn’t set. The rule is simple – if slavethr is larger than the throughput implied by mbrc (etc.) the optimizer ignores it and falls back to the “90% of serial” model.

Reminder.

I’ve been showing you how Oracle does the arithmetic with the statistics it has. It’s very important to remember that this is just arithmetic – it’s Oracle trying to work out the best (likely) execution plan given some assumptions about what ought to be the limiting factors when the query runs. In effect the arithmetic can have the effect of saying: “if we assume (based on the statistics) that we can’t do better than parallel 6 then the best plan is P” – but if the hint actually says /*+ parallel(t1 42) */ then at run time Oracle will take the plan that’s appropriate for running parallel 6 and try to run it at parallel 42 – and that may be a big mistake.

Warning: The manuals say that maxthr and slavethr are stored as bytes per second; it seems that they’re really bytes per millisecond in (at least) 10g and 11g, but change to bytes per second in 12c. If you upgrade to 12c, make sure you check your system statistics before and after the upgrade to make sure that you have allowed for this change otherwise you may find that Oracle becomes very unenthusiastic about running parallel queries.

June 27, 2013

maxthr – 2

Filed under: CBO,Oracle,Parallel Execution,Statistics,System Stats — Jonathan Lewis @ 5:08 pm BST Jun 27,2013

Actually, there hasn’t been a “maxthr – 1″, I called the first part of this series“System Stats”. If you look back at it you’ll see that I set up some system statistics, excluding the maxthr and slavethr values, and described how the optimizer would calculate the cost of a serial tablescan, then I followed this up with a brief description of how the calculations changed if I hinted the optimizer into a parallel tablescan.

(more…)

Next Page »

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,258 other followers