I started writing this note in March 2015 with the following introductory comment:
A little while ago I wrote a few notes about a very resource-intensive parallel query. One of the points I made about it was that it was easy to model, and then interesting to run on later versions of Oracle. So today I’m going to treat you to a few of the observations and notes I made after modelling the problem; and here’s the SQL to create the underlying objects:
Unfortunately I failed to do anything more with the model I had created until a few days ago (June 2019 – in case I stall again) when a related question came up on the ODC database forum. This time I’m ready to go a little further – so I’ll start with a bait-and-switch approach. Here are the first few lines (omitting the SQL) of an SQL Monitor report from an instance of 18.3 – is this a power-crazed machine or what ?
Global Information ------------------------------ Status : DONE (ALL ROWS) Instance ID : 1 Session : TEST_USER (169:11324) SQL ID : 73y5quma4jnw4 SQL Execution ID : 16777216 Execution Started : 06/13/2019 22:06:32 First Refresh Time : 06/13/2019 22:06:32 Last Refresh Time : 06/13/2019 22:07:03 Duration : 31s Module/Action : MyModule/MyAction Service : SYS$USERS Program : sqlplus@linux183.localdomain (TNS V1-V3) Fetch Calls : 591 Global Stats ========================================================================================= | Elapsed | Cpu | IO | Concurrency | Other | Fetch | Buffer | Read | Read | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Calls | Gets | Reqs | Bytes | ========================================================================================= | 14 | 3.18 | 0.00 | 0.05 | 11 | 591 | 25978 | 62 | 13MB | ========================================================================================= Parallel Execution Details (DOP=3 , Servers Allocated=6730) ==========================================================================================
It didn’t take long to run the query, only about 31 seconds. But the thing to notice in the report is that while the DOP is reported as 3, the number of “Servers Allocated” is a massive 6,730. So the big question (before I show you more of the report, explain what’s happening, and supply the code to build the model) is this: how many PX processes did I actually start.
Here’s a little more of the output:
Parallel Execution Details (DOP=3 , Servers Allocated=6730) ========================================================================================================================================================== | Name | Type | Group# | Server# | Elapsed | Cpu | IO | Concurrency | Other | Buffer | Read | Read | Wait Events | | | | | | Time(s) | Time(s) | Waits(s) | Waits(s) | Waits(s) | Gets | Reqs | Bytes | (sample #) | ========================================================================================================================================================== | PX Coordinator | QC | | | 14 | 3.13 | | 0.05 | 11 | 23727 | | . | PX Deq: Join ACK (5) | | | | | | | | | | | | | | PX Deq: Signal ACK EXT (2) | | | | | | | | | | | | | | sql_id: 6405a2hc50bt4 (1) | | p004 | Set 1 | 1 | 1 | 0.00 | 0.00 | | | | 180 | | . | library cache: mutex X (1) | | | | | | | | | | | | | | | | p005 | Set 1 | 1 | 2 | 0.00 | 0.00 | | | | 100 | | . | | | p006 | Set 1 | 1 | 3 | 0.00 | 0.00 | | | | 90 | | . | | | p000 | Set 1 | 2 | 1 | 0.01 | 0.01 | | | | | | . | | | p001 | Set 1 | 2 | 2 | 0.02 | 0.02 | | | | | | . | | | p002 | Set 2 | 2 | 1 | 0.01 | 0.01 | 0.00 | | | 944 | 32 | 7MB | | | p003 | Set 2 | 2 | 2 | 0.01 | 0.01 | 0.00 | | | 937 | 30 | 7MB | | ==========================================================================================================================================================
Despite “allocating” 6,730 servers Oracle is only admitting to having used 7 of them -so let’s take a closer look at how they’re used. There are two groups and we have one set of 3 slaves in group 1, and two sets of two slaves in group 2. (It looks to me as if the Group# and Type columns should be the other way around given the hierarchy of group / type / server#). We can understand a little more of what these numbers mean if we look at the execution plan – particularly the special columns relating to Data Flow Operations (DFOs) and “DFO trees”.
SQL Plan Monitoring Details (Plan Hash Value=3398913290) ======================================================================================================================================================================== | Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Read | Read | Mem | Activity | Activity Detail | | | | | (Estim) | | Active(s) | Active | | (Actual) | Reqs | Bytes | (Max) | (%) | (# samples) | ======================================================================================================================================================================== | 0 | SELECT STATEMENT | | | | 32 | +0 | 1 | 8846 | | | . | 2.70 | Cpu (1) | | 1 | FILTER | | | | 32 | +0 | 1 | 8846 | | | . | 5.41 | PX Deq: Signal ACK EXT (2) | | 2 | PX COORDINATOR | | | | 32 | +0 | 5 | 8846 | | | . | | | | 3 | PX SEND QC (RANDOM) | :TQ20002 | 9146 | 128 | 29 | +2 | 2 | 8846 | | | . | | | | 4 | HASH JOIN BUFFERED | | 9146 | 128 | 29 | +2 | 2 | 8846 | | | 9MB | | | | 5 | PX RECEIVE | | 8846 | 11 | 14 | +2 | 2 | 8846 | | | . | | | | 6 | PX SEND HYBRID HASH | :TQ20000 | 8846 | 11 | 1 | +0 | 2 | 8846 | | | . | | | | 7 | STATISTICS COLLECTOR | | | | 1 | +0 | 2 | 8846 | | | . | | | | 8 | PX BLOCK ITERATOR | | 8846 | 11 | 1 | +0 | 2 | 8846 | | | . | | | | 9 | TABLE ACCESS FULL | T2 | 8846 | 11 | 1 | +0 | 23 | 8846 | 24 | 1MB | . | | | | 10 | PX RECEIVE | | 50000 | 116 | 14 | +2 | 2 | 2509 | | | . | | | | 11 | PX SEND HYBRID HASH | :TQ20001 | 50000 | 116 | 1 | +0 | 2 | 2509 | | | . | | | | 12 | PX BLOCK ITERATOR | | 50000 | 116 | 1 | +0 | 2 | 2509 | | | . | | | | 13 | TABLE ACCESS FULL | T1 | 50000 | 116 | 1 | +0 | 26 | 2509 | 38 | 12MB | . | | | | 14 | PX COORDINATOR | | | | 31 | +1 | 8978 | 2252 | | | . | 13.51 | PX Deq: Join ACK (5) | | 15 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 77 | 32 | +0 | 6667 | 3692 | | | . | | | | 16 | PX BLOCK ITERATOR | | 1 | 77 | 32 | +0 | 6667 | 92478 | | | . | 2.70 | Cpu (1) | | 17 | TABLE ACCESS FULL | T3 | 1 | 77 | 32 | +0 | 53118 | 92478 | 32 | 8MB | . | 67.57 | Cpu (25) | ========================================================================================================================================================================
The “Name” column shows us that we have two DFO trees (:TQ2nnnn, and :TQ1nnnn) – this is why we see two “groups” in PX server detail, and why those groups can have difference degrees of parallelism.
Looking at the general shape of the plan you can see that operation 1 is a FILTER operation with two child operations, one at operation 2 the other at operation 14. So we probably have a filter subquery in place operated as DFO tree #1 while the main query is operated as DFO tree #2. This means the main query is running at DOP = 2 (it’s a hash join with hash distribution so it needs two sets of slave processes so all the details agree with what we’ve learned abaout Group# 2 above); and the subquery is operating at DOP = 3 – and it’s using only one set of slave processes.
There is a little anomaly in the number of Execs of operation 14 – at some point I will examine this more closely but it might simply be a reporting error that has added the number of Execs of its child operations to its own Execs, it might be something to do with counting in Exec calls by its parent, it might be a side effect of scalar subquery caching. I’ll worry about it when I have a good reason to do so. What I want to look at is the Execs of operations 15/16, the PX Block Iterator / PX Send QC. There are 6,667 reports of PX slave executing, and that matches up quite nicely with the 6,730 reported “Servers Allocated” – so it would appear that Oracle says it’s allocating a server whenever it uses a server. But does it really “allocate” (and, presumably, de-allocate).
Here’s how you find out – you run the query again taking various snapshot and looking for cross-references. I’ve got some results from v$pq_tqstat and v$pq_slave for the run that produced the SQL Monitor report above, and some of the QC session stats and enqueue stats for a subsequent run. This is what we see:
select process, count(*) from v$pq_tqstat group by process order by process ; PROCESS COUNT(*) ------------------------ ---------- P000 3 P001 3 P002 2 P003 2 P004 2225 P005 2214 P006 2218 QC 2243 SQL> select slave_name, sessions from V$pq_slave order by slave_name; SLAV SESSIONS ---- ---------- P000 1 P001 1 P002 1 P003 1 P004 2242 P005 2242 P006 2242 Key Session Stats ================= Name Value ---- ----- opened cursors cumulative 6,955 user calls 20,631 recursive calls 20,895 enqueue requests 22,699 enqueue conversions 13,610 enqueue releases 15,894 parse count (total) 6,857 execute count 6,966 DFO trees parallelized 2 Parallel operations not downgraded 2,268 Key Enqueue Stats ================= Type Short name Requests Waits Success Failed Wait m/s ---- ---------- -------- ----- ------- ------ -------- DA Slave Process Array 2,272 13 2,272 0 43 PS contention 27,160 1,586 27,080 7 415 SE contention 6,784 0 6,785 0 0 TYPE DESCRIPTION -------------------- ------------------------------------------------------------------------ PS Parallel Execution Server Process reservation and synchronization DA Slave Process Spawn reservation and synchronization SE Lock used by transparent session migration
Oracle really did start and stop something like 6,700 PX sessions(constantly re-using the same small set of PX slave processes) for each execution of the filter subquery. This is definitely a performance threat – we keep acquiring and releasing PX slaves, we keep creating new sessions (yes, really), and we keep searching for cursors in the library cache. All these activities are highly contentious. If you start running multiple queries that do this sort of thing you find that you see increasing amounts of time being spent on latch contention, PX slave allocation, mutex waits, and all the other problems you get with sessions that log on, do virtually nothing, then log off in rapid succession.
So how do you write SQL that does this type of thing. Here’s my data model (you may want to limit the number of rows in the tables:
rem rem Script: parallel_filter_2.sql rem Author: Jonathan Lewis rem Dated: June 2019 rem create table t1 as select * from all_source; create table t2 as select * from all_source where mod(line,20) = 1; create table t3 as select * from all_source;
And here’s all you have to do to start creating problems – I’ve added explicit hints to force parallelism (particularly for the subquery), it’s more likely that it has been introduced accidentally by table or index definitions, or by an “alter session” to “force parallel”:
set feedback only select /*+ parallel(t1 2) parallel(t2 2) leading(t1 t2) use_hash(t2) swap_join_inputs(t2) pq_distribute(t2 hash hash) cardinality(t1,50000) */ t1.owner, t1.name, t1.type from t1 join t2 on t2.owner = t1.owner and t2.name = t1.name and t2.type = t1.type where t1.line = 1 and ( mod(length(t1.text), 10) = 0 or exists ( select --+ parallel(t3 3) null from t3 where t3.owner = t1.owner and t3.name = t1.name and t3.type = t1.type and t3.line >= t1.line ) ) ; set feedback on
I’ve written notes in the past about SQL that forces the optimizer to run subqueries as filter subqueries instead of unnesting them – this is just an example of that type of query, pushed into parallelism. It’s not the only way (see comment #1 from Dmitry Remizov below) to end up with scalar subqueries being executed many times as separate DFO trees even though Oracle has enhanced the optimizer several times over the years in ways that bypass the threat – but the probalm can still appear and it’s important to notice in development that you’ve got a query that Oracle can’t work around.
Update (July 2020)
In the comment below I’ve said that 19.3 seems to have some code to do better and my test run reported only 290 re-allocations per PX process for the subquery. However I’ve just been prompted to re-run the test on 19.3 and v$pq_tqstat reported 2,300 rows for each of the three PX processes and the SQL Monitor output reported “Servers Allocated=6979”. Perhaps the earlier test got lucky with data ordering and benefited from scalar subquery caching.
Hi Jonathan,
I’m not sure that I’ve observed the same issue recently but looks very similar (tested on 11.2.0.4, 12.1).
it was related to CASE case with internal subqueries inside.
I’ve managed to created my simplified test case:
Comment by Dmitry Remizov — June 13, 2019 @ 6:51 pm BST Jun 13,2019 |
Dmitry,
That’s one of the ways that this type of report can appear
(I’ll have to use Plan B for my example now ;)
I edited your comment to put in the “sourcecode” (in square brackets) tags that make it fixed font. I haven’t done a test run, so I hope I didn’t make any mistakes while doing the edit.
Comment by Jonathan Lewis — June 13, 2019 @ 7:10 pm BST Jun 13,2019 |
[…] By a lucky coincidence this allowed me to rediscover and publish an old example of parallel execution gone wild before moving on to talk about the fundamental problem exhibited in the latest […]
Pingback by Can’t Unnest | Oracle Scratchpad — June 17, 2019 @ 3:36 pm BST Jun 17,2019 |
Hi Jonathan
Looks like the filter subquery problem of allocating huge number of PX servers has been fixed on 18c.
Now the FILTER is below the query coordinator and Oracle does not allocate crazy number of PX slaves.
Haven’t tested scalar subqueries yet.
—
Best Regards
Vishal
Comment by Vishal — May 20, 2020 @ 6:06 pm BST May 20,2020 |
Vishal,
Thanks for the comment.
Which exact version of Oracle were you using ? I ran the demo above on 18.3, and I’ve just repeated it on 19.3 with very similar (though not quite so extreme) results.
For example, from 19.3 the results from vpq_tqstat were as follows:
(and the “Servers Allocated” figure from the SQL Monitor report was 907)
Regards
Jonathan Lewis
Comment by Jonathan Lewis — May 20, 2020 @ 7:47 pm BST May 20,2020 |
Thanks Jonathan. I picked up the test case from Randolf’s blog. Query is similar with a subquery that is not unnested and instead forms a FILTER.
https://oracle-randolf.blogspot.com/2015/07/12c-parallel-execution-new-features.html
This test case on 11.2.0.4 results in 100K PX servers allocated but on 18.1 it requests for only 4.
—
Best Regards
Vishal
Comment by Vishal — May 20, 2020 @ 8:03 pm BST May 20,2020 |
Vishal,
There are so many variants – some version dependent – about what could happen with filter subqueries that you do have to look carefully at the plans to determine what’s actually happening. This is the query in Randolf’s comparison of 11.2.0.4 and 12.1.0.2:
This operates as a filter subquery in both those versions, but by 12.2.0.1 the optimizer is smart enough to recognise a special case and rewrite this query to the logically equivalent:
Since t_1 is declared parallel 4 you then get a simple tablescan using 4 parallel query slaves.
Regards
Jonathan Lewis
Comment by Jonathan Lewis — May 20, 2020 @ 8:49 pm BST May 20,2020 |
A tiny test case to get this “Fun”:
Comment by Timur Akhmadeev — August 5, 2020 @ 2:35 pm BST Aug 5,2020 |