According to the Oracle Database VLDB and Partitioning Guide (10g version and 11g version):
A SELECT
statement can be executed in parallel only if the following conditions are satisfied:
- The query includes a parallel hint specification (
PARALLEL
orPARALLEL_INDEX
) or the schema objects referred to in the query have aPARALLEL
declaration associated with them. - At least one table specified in the query requires one of the following:
- A full table scan
- An index range scan spanning multiple partitions
- No scalar subqueries are in the
SELECT
list.
Note, particularly, that last restriction. I was looking at a query recently that seemed to be breaking this rule so, after examining the 10053 trace file for a while, I decided that I would construct a simplified model of the client’s query to demonstrate how the manuals can tell you the truth while being completely deceptive or (conversely) be wrong while still giving a perfectly correct impression. So here’s a query, with execution plan, from 11.2.0.4:
rem rem Script: parallel_scalar_subq.sql rem Author: Jonathan Lewis rem Dated: Jun 2013 rem rem Last tested rem 12.1.0.2 New options rem 11.2.0.4 Scalar Subqueries run by QC rem select /*+ parallel(t1 2) */ d1.small_vc, t1.r1, t2.n21, t2.v21, t3.v31, (select max(v1) from ref1 where n1 = t2.n21) ref_t2, (select max(v1) from ref2 where n1 = t1.r1) ref_t1, t1.padding from driver d1, t1, t2, t3 where d1.n1 = 1 and t1.n1 = d1.id and t1.n2 = 10 and t1.n3 = 10 and t2.id = t1.r2 and t3.id = t1.r3 ; ---------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib | ---------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 15700 | 1340 (3)| 00:00:07 | | | | | 1 | SORT AGGREGATE | | 1 | 10 | | | | | | | 2 | TABLE ACCESS BY INDEX ROWID| REF1 | 1 | 10 | 2 (0)| 00:00:01 | | | | |* 3 | INDEX UNIQUE SCAN | R1_PK | 1 | | 1 (0)| 00:00:01 | | | | | 4 | SORT AGGREGATE | | 1 | 10 | | | | | | | 5 | TABLE ACCESS BY INDEX ROWID| REF2 | 1 | 10 | 2 (0)| 00:00:01 | | | | |* 6 | INDEX UNIQUE SCAN | R2_PK | 1 | | 1 (0)| 00:00:01 | | | | | 7 | PX COORDINATOR | | | | | | | | | | 8 | PX SEND QC (RANDOM) | :TQ10003 | 100 | 15700 | 1340 (3)| 00:00:07 | Q1,03 | P->S | QC (RAND) | |* 9 | HASH JOIN | | 100 | 15700 | 1340 (3)| 00:00:07 | Q1,03 | PCWP | | |* 10 | HASH JOIN | | 100 | 14700 | 1317 (3)| 00:00:07 | Q1,03 | PCWP | | |* 11 | HASH JOIN | | 100 | 13300 | 1294 (3)| 00:00:07 | Q1,03 | PCWP | | | 12 | BUFFER SORT | | | | | | Q1,03 | PCWC | | | 13 | PX RECEIVE | | 100 | 1300 | 4 (0)| 00:00:01 | Q1,03 | PCWP | | | 14 | PX SEND BROADCAST | :TQ10000 | 100 | 1300 | 4 (0)| 00:00:01 | | S->P | BROADCAST | |* 15 | TABLE ACCESS FULL | DRIVER | 100 | 1300 | 4 (0)| 00:00:01 | | | | | 16 | PX BLOCK ITERATOR | | 100 | 12000 | 1290 (3)| 00:00:07 | Q1,03 | PCWC | | |* 17 | TABLE ACCESS FULL | T1 | 100 | 12000 | 1290 (3)| 00:00:07 | Q1,03 | PCWP | | | 18 | BUFFER SORT | | | | | | Q1,03 | PCWC | | | 19 | PX RECEIVE | | 10000 | 136K| 23 (5)| 00:00:01 | Q1,03 | PCWP | | | 20 | PX SEND BROADCAST | :TQ10001 | 10000 | 136K| 23 (5)| 00:00:01 | | S->P | BROADCAST | | 21 | TABLE ACCESS FULL | T2 | 10000 | 136K| 23 (5)| 00:00:01 | | | | | 22 | BUFFER SORT | | | | | | Q1,03 | PCWC | | | 23 | PX RECEIVE | | 10000 | 97K| 23 (5)| 00:00:01 | Q1,03 | PCWP | | | 24 | PX SEND BROADCAST | :TQ10002 | 10000 | 97K| 23 (5)| 00:00:01 | | S->P | BROADCAST | | 25 | TABLE ACCESS FULL | T3 | 10000 | 97K| 23 (5)| 00:00:01 | | | | ---------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("N1"=:B1) 6 - access("N1"=:B1) 9 - access("T3"."ID"="T1"."R3") 10 - access("T2"."ID"="T1"."R2") 11 - access("T1"."N1"="D1"."ID") 15 - filter("D1"."N1"=1) 17 - filter("T1"."N2"=10 AND "T1"."N3"=10)
Thanks to my hint the query has been given a parallel execution plan – and a check of v$pq_tqstat after running the query showed that it had run parallel. Note, however, where the PX SEND QC and PX COORDINATOR operations appear – lines 7 and 8, and above those lines we see the two scalar subqueries.
This means we’re running the basic select statement as a parallel query but the query co-ordinator has serialised on the scalar subqueries in the select list. Is the manual “right but deceptive” or “wrong but giving the right impression” ? Serialising on (just) the scalar subqueries can have a huge impact on the performance and effectively make the query behave like a serial query even though, technically, the statement has run as a parallel query.
You may recall that an example of this type of behaviour, and its side effects when the scalar subqueries executed independently as parallel queries, showed up some time ago. At the time I said I would follow up with a note about the change in behaviour in 12c; this seems to be an appropriate moment to show the 12c plan(s), first the default:
---------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib | ---------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 19100 | 1364 (3)| 00:00:01 | | | | | 1 | PX COORDINATOR | | | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10005 | 100 | 19100 | 1364 (3)| 00:00:01 | Q1,05 | P->S | QC (RAND) | |* 3 | HASH JOIN BUFFERED | | 100 | 19100 | 1364 (3)| 00:00:01 | Q1,05 | PCWP | | |* 4 | HASH JOIN OUTER | | 100 | 18100 | 1340 (3)| 00:00:01 | Q1,05 | PCWP | | |* 5 | HASH JOIN | | 100 | 16400 | 1335 (3)| 00:00:01 | Q1,05 | PCWP | | |* 6 | HASH JOIN OUTER | | 100 | 15000 | 1311 (3)| 00:00:01 | Q1,05 | PCWP | | |* 7 | HASH JOIN | | 100 | 13300 | 1306 (3)| 00:00:01 | Q1,05 | PCWP | | | 8 | PX RECEIVE | | 100 | 1300 | 4 (0)| 00:00:01 | Q1,05 | PCWP | | | 9 | PX SEND BROADCAST | :TQ10000 | 100 | 1300 | 4 (0)| 00:00:01 | Q1,00 | S->P | BROADCAST | | 10 | PX SELECTOR | | | | | | Q1,00 | SCWC | | |* 11 | TABLE ACCESS FULL | DRIVER | 100 | 1300 | 4 (0)| 00:00:01 | Q1,00 | SCWP | | | 12 | PX BLOCK ITERATOR | | 100 | 12000 | 1302 (3)| 00:00:01 | Q1,05 | PCWC | | |* 13 | TABLE ACCESS FULL | T1 | 100 | 12000 | 1302 (3)| 00:00:01 | Q1,05 | PCWP | | | 14 | PX RECEIVE | | 1000 | 17000 | 5 (20)| 00:00:01 | Q1,05 | PCWP | | | 15 | PX SEND BROADCAST | :TQ10001 | 1000 | 17000 | 5 (20)| 00:00:01 | Q1,01 | S->P | BROADCAST | | 16 | PX SELECTOR | | | | | | Q1,01 | SCWC | | | 17 | VIEW | VW_SSQ_1 | 1000 | 17000 | 5 (20)| 00:00:01 | Q1,01 | SCWC | | | 18 | HASH GROUP BY | | 1000 | 10000 | 5 (20)| 00:00:01 | Q1,01 | SCWC | | | 19 | TABLE ACCESS FULL| REF2 | 1000 | 10000 | 4 (0)| 00:00:01 | Q1,01 | SCWP | | | 20 | PX RECEIVE | | 10000 | 136K| 24 (5)| 00:00:01 | Q1,05 | PCWP | | | 21 | PX SEND BROADCAST | :TQ10002 | 10000 | 136K| 24 (5)| 00:00:01 | Q1,02 | S->P | BROADCAST | | 22 | PX SELECTOR | | | | | | Q1,02 | SCWC | | | 23 | TABLE ACCESS FULL | T2 | 10000 | 136K| 24 (5)| 00:00:01 | Q1,02 | SCWP | | | 24 | PX RECEIVE | | 1000 | 17000 | 5 (20)| 00:00:01 | Q1,05 | PCWP | | | 25 | PX SEND BROADCAST | :TQ10003 | 1000 | 17000 | 5 (20)| 00:00:01 | Q1,03 | S->P | BROADCAST | | 26 | PX SELECTOR | | | | | | Q1,03 | SCWC | | | 27 | VIEW | VW_SSQ_2 | 1000 | 17000 | 5 (20)| 00:00:01 | Q1,03 | SCWC | | | 28 | HASH GROUP BY | | 1000 | 10000 | 5 (20)| 00:00:01 | Q1,03 | SCWC | | | 29 | TABLE ACCESS FULL | REF1 | 1000 | 10000 | 4 (0)| 00:00:01 | Q1,03 | SCWP | | | 30 | PX RECEIVE | | 10000 | 97K| 24 (5)| 00:00:01 | Q1,05 | PCWP | | | 31 | PX SEND BROADCAST | :TQ10004 | 10000 | 97K| 24 (5)| 00:00:01 | Q1,04 | S->P | BROADCAST | | 32 | PX SELECTOR | | | | | | Q1,04 | SCWC | | | 33 | TABLE ACCESS FULL | T3 | 10000 | 97K| 24 (5)| 00:00:01 | Q1,04 | SCWP | | ---------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("T3"."ID"="T1"."R3") 4 - access("ITEM_2"(+)="T2"."N21") 5 - access("T2"."ID"="T1"."R2") 6 - access("ITEM_1"(+)="T1"."R1") 7 - access("T1"."N1"="D1"."ID") 11 - filter("D1"."N1"=1) 13 - filter("T1"."N2"=10 AND "T1"."N3"=10)
The first thing to note is the location of the PX SEND QC and PX COORDINATOR operations – right at the top of the plan: there’s no serialisation at the query coordinator. Then we spot the views at operations 17 and 27 – VW_SSQ_1, VW_SSQ_2 (would SSQ be “scalar subquery”, perhaps). The optimimzer has unnested the scalar subqueries out of the select list into the join. When a scalar subquery in the select list returns no data it’s value is deemed to be NULL so the joins (operations 4 and 6) have to be outer joins.
You’ll notice that there are a lot of PX SELECTOR operations – each feeding a PX SEND BROADCAST operations that reports its “IN-OUT” column as S->P (i.e. serial to parallel). Historically a serial to parallel operation started with the query coordinator doing the serial bit but in 12c the optimizer can dictate that one of the PX slaves should take on that task (see Randolf Geist’s post here). Again my code to report v$pq_tqstat confirmed this behaviour in a way that we shall see shortly.
This type of unnesting is a feature of 12c and in some cases will be very effective. It is a cost-based decision, though, and the optimizer can make mistakes; fortunately we can control the feature. We could simply set the optimizer_features_enable back to 11.2.0.4 (perhaps through the hint) and this would take us back to the original plan, but this isn’t the best option in this case. There is a hidden parameter _optimizer_unnest_scalar_sq enabling the feature so we could, in principle, disable just the one feature by setting that parameter to false; a more appropriate strategy would simply be to tell the optimizer that it should not unnest the subqueries. In my case I could put the /*+ no_unnest */ hint into both the subqueries or use the qb_name() hint to give the two subquery blocks names, and then used the /*+ no_unnest() */ hint with the “@my_qb_name” format at the top of the main query. Here’s the execution plan I get whether I use the hidden parameter or the /*+ no_unnest */ mechanim:
------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib | ------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 1554 (100)| | | | | | 1 | PX COORDINATOR | | | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10003 | 100 | 15700 | 1354 (3)| 00:00:01 | Q1,03 | P->S | QC (RAND) | | 3 | EXPRESSION EVALUATION | | | | | | Q1,03 | PCWC | | |* 4 | HASH JOIN | | 100 | 15700 | 1354 (3)| 00:00:01 | Q1,03 | PCWP | | |* 5 | HASH JOIN | | 100 | 14700 | 1330 (3)| 00:00:01 | Q1,03 | PCWP | | |* 6 | HASH JOIN | | 100 | 13300 | 1306 (3)| 00:00:01 | Q1,03 | PCWP | | | 7 | BUFFER SORT | | | | | | Q1,03 | PCWC | | | 8 | PX RECEIVE | | 100 | 1300 | 4 (0)| 00:00:01 | Q1,03 | PCWP | | | 9 | PX SEND BROADCAST | :TQ10000 | 100 | 1300 | 4 (0)| 00:00:01 | | S->P | BROADCAST | |* 10 | TABLE ACCESS FULL | DRIVER | 100 | 1300 | 4 (0)| 00:00:01 | | | | | 11 | PX BLOCK ITERATOR | | 100 | 12000 | 1302 (3)| 00:00:01 | Q1,03 | PCWC | | |* 12 | TABLE ACCESS FULL | T1 | 100 | 12000 | 1302 (3)| 00:00:01 | Q1,03 | PCWP | | | 13 | BUFFER SORT | | | | | | Q1,03 | PCWC | | | 14 | PX RECEIVE | | 10000 | 136K| 24 (5)| 00:00:01 | Q1,03 | PCWP | | | 15 | PX SEND BROADCAST | :TQ10001 | 10000 | 136K| 24 (5)| 00:00:01 | | S->P | BROADCAST | | 16 | TABLE ACCESS FULL | T2 | 10000 | 136K| 24 (5)| 00:00:01 | | | | | 17 | BUFFER SORT | | | | | | Q1,03 | PCWC | | | 18 | PX RECEIVE | | 10000 | 97K| 24 (5)| 00:00:01 | Q1,03 | PCWP | | | 19 | PX SEND BROADCAST | :TQ10002 | 10000 | 97K| 24 (5)| 00:00:01 | | S->P | BROADCAST | | 20 | TABLE ACCESS FULL | T3 | 10000 | 97K| 24 (5)| 00:00:01 | | | | | 21 | SORT AGGREGATE | | 1 | 10 | | | | | | | 22 | TABLE ACCESS BY INDEX ROWID| REF1 | 1 | 10 | 2 (0)| 00:00:01 | | | | |* 23 | INDEX UNIQUE SCAN | R1_PK | 1 | | 1 (0)| 00:00:01 | | | | | 24 | SORT AGGREGATE | | 1 | 10 | | | | | | | 25 | TABLE ACCESS BY INDEX ROWID| REF2 | 1 | 10 | 2 (0)| 00:00:01 | | | | |* 26 | INDEX UNIQUE SCAN | R2_PK | 1 | | 1 (0)| 00:00:01 | | | | ------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - access("T3"."ID"="T1"."R3") 5 - access("T2"."ID"="T1"."R2") 6 - access("T1"."N1"="D1"."ID") 10 - filter("D1"."N1"=1) 12 - access(:Z>=:Z AND :Z<=:Z) filter(("T1"."N2"=10 AND "T1"."N3"=10)) 23 - access("N1"=:B1) 26 - access("N1"=:B1)
Note particularly that the PX SEND QC and PX COORDINATOR operations are operations 2 and 1,, and we have a new operator EXPRESSSION EVALUATION at operation 3. This has three child operations – the basic select starting at operation 4, and the two scalar subqueries starting at lines 21 and 24. We are operating the scalar subqueries as correlated subqueries, but we don’t leave all the work to the query coordinator – each slave is running its own subqueries before forwarding the final result to the coordinator. There is a little side effect that goes with this change – the “serial to parallel” operations are now, as they always used to be, driven by the query co-ordinator, the PX SELECTOR operations have disappeared.
And finally
Just to finish off, let’s take a look at the results from v$pq_tqstat in 12.1.0.2. First from the default plan with the PX SELECTOR operations. Remember that this turned into a five table join where two of the “tables” were non-correlated aggregate queries against the reference tables.
DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS BYTES WAITS TIMEOUTS AVG_LATENCY ---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- ----------- 1 0 Producer 1 P002 200 2428 0 0 0 1 P003 0 48 0 0 0 Consumer 1 P000 100 1238 59 27 0 1 P001 100 1238 41 24 0 1 Producer 1 P002 2000 23830 0 0 0 1 P003 0 48 0 0 0 Consumer 1 P000 1000 11939 57 26 0 1 P001 1000 11939 41 24 0 2 Producer 1 P002 0 48 0 0 0 1 P003 20000 339732 0 0 0 Consumer 1 P000 10000 169890 49 22 0 1 P001 10000 169890 31 21 0 3 Producer 1 P002 0 48 0 0 0 1 P003 2000 23830 0 0 0 Consumer 1 P000 1000 11939 58 26 0 1 P001 1000 11939 38 23 0 4 Producer 1 P002 0 48 0 0 0 1 P003 20000 239986 0 0 0 Consumer 1 P000 10000 120017 50 22 0 1 P001 10000 120017 34 21 0 5 Producer 1 P000 1 169 0 0 0 1 P001 1 169 1 0 0 Consumer 1 QC 2 338 3 0 0
As you read down the table queues you can see that in the first five table queues (0 – 4) we seem to operate parallel to parallel, but only one of the two producers (p002 and p003) produces any data at each stage. A more traditional plan would show QC as the single producer in each of these stages.
Now with scalar subquery unnesting blocked – the plan with the three table join and EXPRESSION EVALUATION – we see the more traditional serial to parallel, the producer is QC in all three of the first table queues (the full scan and broadcast of tables t1, t2, and t3).
DFO_NUMBER TQ_ID SERVER_TYPE INSTANCE PROCESS NUM_ROWS BYTES WAITS TIMEOUTS AVG_LATENCY ---------- ---------- --------------- -------- --------------- ---------- ---------- ---------- ---------- ----------- 1 0 Producer 1 QC 200 1726 0 0 0 Consumer 1 P000 100 1614 28 15 0 1 P001 100 1614 34 13 0 1 Producer 1 QC 20000 339732 0 0 0 Consumer 1 P000 10000 169866 19 10 0 1 P001 10000 169866 25 8 0 2 Producer 1 QC 20000 239986 0 0 0 Consumer 1 P000 10000 119993 23 11 0 1 P001 10000 119993 31 11 0 3 Producer 1 P000 1 155 1 0 0 1 P001 1 155 0 0 0 Consumer 1 QC 2 310 3 1 0
It’s an interesting point that this last set of results is identical to the set produced in 11g – you can’t tell from v$pq_tqstat whether the parallel slaves or the query co-ordinator executed the subqueries – you have to look at the output from SQL trace (or similar) to see the individual Rowsource Executions Statistics for the slaves and coordinator to see which process actually ran the subqueries.
Tanel Poder described this new 12c unnesting CBO enhancement in a little bit more detail: http://blog.tanelpoder.com/2013/08/13/oracle-12c-scalar-subquery-unnesting-transformation/
Comment by Stefan Koehler — May 14, 2015 @ 12:33 pm BST May 14,2015 |
[…] Jonathan Lewis “Parallel Query” […]
Pingback by Oracle SQL | 12c: New SQL PLAN OPERATIONS and HINTS — July 7, 2016 @ 7:53 pm BST Jul 7,2016 |
Hi Jonathan,
Did you have a chance to check whether or not every parallel slave process works on evaluating scalar subqueries?
I’m trying to make all slaves work on that calculation, but what I see from statistics and execution time of my query, only a single slave actually does the calculation.
What can prohibit other slaves to work on the calculation of scalar subqueries in a select list?
I put here just part of the SQL Monitor output.
The entire report is available here https://docs.google.com/document/d/1Uq-cmIkRHlCzFQTuTekrvGAoK39s78hCqJe8Wdz1vQc/edit?usp=sharing
Comment by Yuri — October 3, 2019 @ 2:44 pm BST Oct 3,2019 |
Yuri,
One thing I notice about your execution plan is that operation 10 predicts and gets just one row – which it then distributes by HASH. This means that only one of the 16 PX servers gets that row to do a hash join – and that may be why the Expression Evaluation appears to be happening only at one server. It’s odd that the optimizer has done this, but the first thing I’d try to bypass the anomaly is to use the pq_distribute() hint to BROADCAST that one row to all 16 PX servers and see what happens then.
You didn’t post the SQL, but it looks as if you may have three scalar subqueries, with two being subject to “expression evaluation” and one being operated by the query coordinator – which is interesting if true.
Comment by Jonathan Lewis — October 3, 2019 @ 4:41 pm BST Oct 3,2019 |
Actually, the main query produced 12M rows and the unique PK column was passed to the scalar subquery as a correlation predicate, so hash distribution must have spread all rows evenly.
I’ll create some simple test-case to demonstrate the issue.
Comment by Yuri — October 4, 2019 @ 7:06 am BST Oct 4,2019 |
Yuri,
The way I read it, the activity starts with operations 10 to 15 running serially and producing one row. This one row is then “hash” distributed to the first set of PX servers, and the “actual” rows shows that there;s only one row, which means only one PX server can do anything further work in the hash join.
The next stage is the fast full scan of XIF1_OBJ which does a hash distribution of 41M rows. The single PX server that has a row operates the hash join at operation 6 – the other PX servers receive and immediately discard any incoming data because their in-memory build tables are empty.
The “expression evaluation” executes at operation 4 and appears to have operations 20 and 22 as its children – which suggests that IF there had been more than one PX server with data those subqueries would have been running concurrently across all 16 servers.
Operation 3 – which executes as the parent to the expression evaluation – passes the data to the query coordinator, which runs the scalar subquery in the select list that is operation 1 (I realised a little late why there was one scalar subquery that was behaving differently from the other two).
I would still suggest your first experiment would be to hint the plan to broadcast the results at operation 9.
Comment by Jonathan Lewis — October 4, 2019 @ 10:04 am BST Oct 4,2019
Jonathan,
At first, I very appreciate the time you spend with my question.
“to use the pq_distribute() hint” – somehow I have not managed to make CBO change the distribution method at step 9 of the plan.
What I managed to do was that I changed the distribution method for XIF1_OBJ and seems subquery processing also got distributed https://docs.google.com/document/d/11ukbkeTOEpdREZ6GtPVNYqa9x1b1BFhSicIUvaqvrlQ/edit?usp=sharing
But in this case, the shape of the execution plan also changed.
I also tried to create some synthetic case which more or less close reproduces the shape of the original execution plan.
Here is a script https://drive.google.com/open?id=1IOezyVmCzX4S_MTbWAyrbozfeMUww7l9
And two results:
– partially reproduced the uneven distribution of scalar subquery processing https://drive.google.com/open?id=1iA1ke2YDy5HA60JHGsTXD23-jvMcI9ko
– even distribution https://drive.google.com/open?id=1YFp0qCSVXr9e3yRDtkXgSn5-vxCz6jn5
The only difference between those two tests is that the first one was run when the optimizer statistics on T_REF and T_MAIN were quite unprecise. After statistics were regathered, even distribution appeared.
But in the latter case, Oracle employed HYBRID HASH even though all adaptive features are disabled on my database instance.
Comment by Yuri — October 4, 2019 @ 4:05 pm BST Oct 4,2019 |
Yuri,
Looking at the two links I’ve got, Step 9 changes from a SEND HASH to a SEND BROADCAST, which means the one row turns into 16 copies, and each slave gets a copy and then fast full scans 1/16th of the XIF1_OBJ index without having to do any inter-slave messaging: and that’s a good thing. It looks to me as if the plan has done exactly the right thing (with a little bad luck on the hash distribution) that allows the expression evaluation to be running in concurrent processes.
Comment by Jonathan Lewis — October 4, 2019 @ 5:19 pm BST Oct 4,2019 |
Jonathan,
It does not make any sense for me so far, it looks like a kind of “double counting”, but numbers say it is rather “double execution”.
A very simple query with a scalar subquery in SELECT list: https://drive.google.com/open?id=1L-srQNcI53vlScZPaSPNMajonalL2RtY
Parallel execution (degree 16) 124 secs, not much faster than the serial run below
– AWR https://drive.google.com/open?id=1UXsrbhTlIAHb8_Dkd2-RdOQoDEcI299z
– SQL run-time stats https://drive.google.com/open?id=1ri0ncRMxCEERscIcbZCLOEmZrGSbscWw
– V$PQ_TQSTAT https://drive.google.com/open?id=19M5S7gv34wn6OyDoulfQFyR8u1rkSo6T
Serial execution takes 133 secs
– AWR https://drive.google.com/open?id=1V6Pso9c5SI7SvuINWQlqJ_qlPGG0H6_q
– SQL run-time stats https://drive.google.com/open?id=1XODIJwjGXrP99E6QtPQckv30Sa1DN_WC
Comment by Yuri — October 8, 2019 @ 12:35 pm BST Oct 8,2019 |
Sorry, I mean LIO numbers for scalar subquery execution steps. The parallel execution shows doubled LIO relatively to the serial execution. I attached AWR reports for both tests to crosscheck the LIO number by “Instance activity” sections.
Comment by Yuri — October 8, 2019 @ 2:20 pm BST Oct 8,2019 |
Yuri,
I’ve downloaded your test script and run the data creation and query parts on 12.2.0.1 and 19.3.0.0 and there’s something very buggy going on.
As it says at the bottom of the article: “you have to look at the output from SQL trace (or similar) to see the individual Rowsource Executions Statistics for the slaves and coordinator to see which process actually ran the subqueries.” so I did.
The QC and the parallel execution slaves both ran the scalar subquery.
I ran parallel 4, then used tkprof on the QC and P00n trace files, then grep’ed out the line from the Rowsource Execution Statistics in the output files that was the “SORT AGGREGATE” line for the scalar subquery. Your script as supplied created only 50,000 rows in the main table – this is the result I got from 12.2
And this is from 19.3 – just a list difference in the way Oracle has done the first/avg/max row reporting (which looks highly suspect in the 12.2. report):
Until further notice and some PX tracing I’m inclined to think there’s an odd PL/SQL effect in the 12.2 results because every slave reported 2 executions of the query, leading to the split figures. The 12.2 “max” figures are the total number.
For a couple of extra checks I created a clone of t_main and used that in the subquery – and the slaves reported a small number of physical reads of the subquery table and index whle the query coordinator reported only buffer gets – suggesting that the slaves had read the blocks before the QC repeating the subquery evaluation.
I also queried v$px_sesstat while the query was running, and this was consistent with slaves and QC running the subquery, and then the slave LIO being summed up into the QC stats as the query ended.
SO: looks like a bug.
Comment by Jonathan Lewis — October 8, 2019 @ 3:43 pm BST Oct 8,2019 |
[…] Parallel query ? (May 2015): When is a parallel query not “parallel enough”. […]
Pingback by Parallel Execution Catalogue | Oracle Scratchpad — September 8, 2022 @ 9:34 am BST Sep 8,2022 |