Parallel Execution | Oracle Scratchpad

November 20, 2023

gby_pushdown

Filed under: CBO,Execution plans,Hints,Oracle,Parallel Execution — Jonathan Lewis @ 6:42 pm GMT Nov 20,2023

This is a note that’s been awaiting completion for nearly 10 years. It’s about a feature (or, at least, a hint for the feature) that appeared in 10.2.0.5 to control some of the inter-process messaging that takes place in parallel execution.

It’s a very simple idea that can make a significant difference in CPU usage for large parallel queries – can you aggregate the raw data before distributing it between slave sets (minimising the volume of traffic) or should you simply distribute the data and aggregate late to avoid aggregating twice. The strategy of aggregating early is known as “group by pushdown”.

I’ll start with a script to create a simple data set and a trivial query with a parallel hint:

rem
rem     Script:         gby_pushdown.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2023
rem
rem     Last tested 
rem             19.11.0.0
rem             12.2.0.1
rem

set linesize 180
set pagesize 60
set trimspool on
set tab off

create table t1 
as 
select 
        * 
from 
        all_objects 
where 
        rownum <= 50000
/


alter session set statistics_level = all;

set serveroutput off

prompt  ==================================================
prompt  Default (for this data) pushdown and hash group by
prompt  ==================================================

set feedback only

select 
        /*+ 
                qb_name(main)
                parallel(t1 2) 
--              no_gby_pushdown(@main)
--              no_use_hash_gby_for_pushdown(@main)
        */ 
        owner, object_type, count(*) ct
from 
        t1 
group by 
        owner, object_type
order by
        owner, object_type
/

set feedback on
select * from table(dbms_xplan.display_cursor(format=>'outline allstats cost hint_report -predicate'));

In the absence of any hints (apart from the qb_name() and parallel() hints), the plan I get from the query is as follows:

Plan hash value: 2919148568

------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |    77 (100)|     96 |00:00:00.07 |       5 |      0 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |            |     96 |00:00:00.07 |       5 |      0 | 11264 | 11264 |     1/0/0|
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |      0 |    276 |    77   (8)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   3 |    SORT GROUP BY         |          |      2 |    276 |    77   (8)|     96 |00:00:00.01 |       0 |      0 |  4096 |  4096 |     2/0/0|
|   4 |     PX RECEIVE           |          |      2 |    276 |    77   (8)|    121 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      PX SEND RANGE       | :TQ10000 |      0 |    276 |    77   (8)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       HASH GROUP BY      |          |      2 |    276 |    77   (8)|    121 |00:00:00.04 |    1043 |    991 |  1079K|  1079K|     2/0/0|
|   7 |        PX BLOCK ITERATOR |          |      2 |  50000 |    73   (3)|  50000 |00:00:00.02 |    1043 |    991 |       |       |          |
|*  8 |         TABLE ACCESS FULL| T1       |     26 |  50000 |    73   (3)|  50000 |00:00:00.01 |    1043 |    991 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      ALL_ROWS
      OUTLINE_LEAF(@"MAIN")
      FULL(@"MAIN" "T1"@"MAIN")
      GBY_PUSHDOWN(@"MAIN")
      USE_HASH_GBY_FOR_PUSHDOWN(@"MAIN")
      END_OUTLINE_DATA
  */

You’ll notice in the Outline Data that Oracle has recorded the use_hash_gby_for_pushdown() hint and the gby_pushdown() hint. I’ll be repeating the query disabling each of these hints in turn – which is why the negative versions of the hints appear as comments in my original query.

If you look at operation 6 of the plan you can see that the optimizer has chosen to use a hash group by to aggregate the selected rows, reducing 50,000 rows to 121 rows. We could query v$pq_tqstat, or run the SQL Monitor report to get more detail about how much work each PX process did, but it’s fairly safe to assume that it was shared reasonably evenly between the two processes.

After aggregating their data the first layer of PX processes distributes the results by range (operation 5, PX Send Range) to the second layer of PX processes, which re-aggregate the much reduced data set. At this point Oracle chooses to aggregate by sorting (Sort Group By) as this will deal with the order by clause at the same time. (Note: you could tell Oracle to split the aggregation and ordering by adding a use_hash_aggregation hint to the query.)

With my data set it’s fairly clear that it’s a good idea to do this “two stage” aggregation because the initial raw data is reduced by a very large factor the first layer of PX processes before they forward the results – and the reduction in inter-process messaging is likely to be a good idea.

There may be cases, of course, where some feature of the data pattern means that two-stage aggregation is a good idea, but aggregating by a sort is more efficient than an aggregation by hashing – a cardinality or clustering estimate might have persuaded the optimizer to make the wrong choice – so let’s see what happens to the plan if we enable the no_use_hash_gby_for_pushdown() hint:

lan hash value: 3954212205

---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem |  O/1/M   |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |    77 (100)|     96 |00:00:00.14 |       5 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |            |     96 |00:00:00.14 |       5 | 11264 | 11264 |     1/0/0|
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |      0 |    276 |    77   (8)|      0 |00:00:00.01 |       0 |       |       |          |
|   3 |    SORT GROUP BY         |          |      2 |    276 |    77   (8)|     96 |00:00:00.03 |       0 |  4096 |  4096 |     2/0/0|
|   4 |     PX RECEIVE           |          |      2 |    276 |    77   (8)|    120 |00:00:00.03 |       0 |       |       |          |
|   5 |      PX SEND RANGE       | :TQ10000 |      0 |    276 |    77   (8)|      0 |00:00:00.01 |       0 |       |       |          |
|   6 |       SORT GROUP BY      |          |      2 |    276 |    77   (8)|    120 |00:00:00.13 |    1043 |  9216 |  9216 |     2/0/0|
|   7 |        PX BLOCK ITERATOR |          |      2 |  50000 |    73   (3)|  50000 |00:00:00.07 |    1043 |       |       |          |
|*  8 |         TABLE ACCESS FULL| T1       |     26 |  50000 |    73   (3)|  50000 |00:00:00.02 |    1043 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------

It doesn’t really make any difference in this very small test case, though the sorting does seem to have take a fraction of a second more CPU. The key change is that operation 6 has become a Sort Group By.

There is an interesting detail to look out for, though, in the Outline Data and Hint Report:

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      ALL_ROWS
      OUTLINE_LEAF(@"MAIN")
      FULL(@"MAIN" "T1"@"MAIN")
      GBY_PUSHDOWN(@"MAIN")
      END_OUTLINE_DATA
  */

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   1 -  MAIN
           -  qb_name(main)

   8 -  MAIN / T1@MAIN
           -  parallel(t1 2)

I added the hint /*+ no_use_hash_gby_for_pushdown(@main) */ to the query, and the hint has clearly worked; but it’s not reported in the Hint Report, and it doesn’t appear in the Outline Data. This suggests that if you created an SQL Baseline for this query with this hint in place the plan would not reproduce because SQL Baseline would be missing the critical hint. (To be tested – left as an exercise to readers.)

The next step is to enable the no_gby_pushdown() hint. For the purposes of the note this is the important one. It’s also the one that you are more likely to use as it’s fairly obvious (if you know the data) when it’s a good idea to use it. (In some cases, of course, the optimizer might have avoided “group by pushdown” when it should have used it – in which case you’d be using the gby_pushdown() hint.) Here’s the plan when I block “group by pushdown”.

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem |  O/1/M   |
--------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |        |    77 (100)|     96 |00:00:00.08 |       5 |       |       |          |
|   1 |  PX COORDINATOR         |          |      1 |        |            |     96 |00:00:00.08 |       5 |  6144 |  6144 |     1/0/0|
|   2 |   PX SEND QC (ORDER)    | :TQ10001 |      0 |    276 |    77   (8)|      0 |00:00:00.01 |       0 |       |       |          |
|   3 |    SORT GROUP BY        |          |      2 |    276 |    77   (8)|     96 |00:00:00.11 |       0 |  4096 |  4096 |     2/0/0|
|   4 |     PX RECEIVE          |          |      2 |  50000 |    73   (3)|  50000 |00:00:00.07 |       0 |       |       |          |
|   5 |      PX SEND RANGE      | :TQ10000 |      0 |  50000 |    73   (3)|      0 |00:00:00.01 |       0 |       |       |          |
|   6 |       PX BLOCK ITERATOR |          |      4 |  50000 |    73   (3)|  50091 |00:00:00.02 |    1051 |       |       |          |
|*  7 |        TABLE ACCESS FULL| T1       |     28 |  50000 |    73   (3)|  50091 |00:00:00.01 |    1051 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      DB_VERSION('19.1.0')
      ALL_ROWS
      OUTLINE_LEAF(@"MAIN")
      FULL(@"MAIN" "T1"@"MAIN")
      END_OUTLINE_DATA
  */

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 3
---------------------------------------------------------------------------
   1 -  MAIN
           -  no_gby_pushdown(@main)
           -  qb_name(main)

   7 -  MAIN / T1@MAIN
           -  parallel(t1 2)

Key details to highlight here are:

There’s only one aggregation step, appearing at operation 3 after the PX SEND/RECEIVE
50,000 rows are passed from slave set 1 (operation 6) to slave set 2 (operation 4).
The no_gby_pushdown(@main) does appear in the Hint Report
The no_gby_pushdown(@main) doesn’t appear in the Outline Data (which is, again, a little worrying).

Again, with this small data set, the change in plan isn’t going to make much difference to the end user, but you may find cases where there is a best choice to keep the user sufficiently happy and save machine resources.

Summary

When using parallel query the optimizer may choose to aggregate the data in two steps so that the current rowsource is reduced by one set of PX processes before it is passed to the next set of PX processes that does a second aggregation step. This tends to be a good idea if the first set of processes can reduce the size of the data set by a large factor and save a lot of machine resources on the subsequence inter-process communication; but it is a bad idea if the data doesn’t reduce and a large volume of data ends up being aggregated in both sets of PX processes.

You can over-ride the optimizer’s choice with the [no_]gby_pushdown() hint.

Comments (4)

July 13, 2023

Parallel DML

Filed under: Oracle,Parallel Execution,Troubleshooting — Jonathan Lewis @ 5:35 pm BST Jul 13,2023

Here’s a note that was prompted by a report about a delete that started up lots of PX processes but apparently did a serial delete even though parallel DML has been forced (alter session force parallel DML parallel 10). The description of “serial” was subsequently refined to explain that “only one of the 48 spawned sessions was doing all the work / IO etc.” so this was not a case of parallel DML being disabled deliberately by Oracle leaving the QC to do the work.

Unfortunately the OP didn’t supply an execution plan (despite 3 people suggesting that this would be helpful), so I’ve dug out an old script that might explain the observed behaviour. It’s based on a simple copy of data from the view all_objects and tries to delete a large fraction of that data using a simple parallel delete statement.

rem
rem     Script:         para_dml_lock.sql
rem     Dated:          June 2010
rem     Author:         Jonathan Lewis
rem

create table t1 as select * from all_objects;
-- create index t1_i1 on t1(owner, object_id, object_name);
-- create index t1_i2 on t1(object_name, owner, object_id);

alter system flush buffer_cache;

spool para_dml_lock

set linesize 180
set pagesize 60
set trimspool on
set serveroutput off

alter session force parallel dml;
alter session set statistics_level = all;
-- alter session set events '10046 trace name context forever, level 8';

delete
         /*+ 
                parallel(t1,2)
                gather_plan_statistics 
        */
from    t1 
where   owner = 'SYS'
;

select * from table(dbms_xplan.display_cursor(format=>'parallel allstats'));

break on  qcsid skip 1 on server_group on server_set on server# on sid skip 1

select
        ses.qcsid,
        ses.server_group,
        ses.server_set,
        ses.server#,
        lck.sid,
        lck.type,
        lck.id1, lck.id2,
        lck.lmode, lck.request,
        lck.ctime, lck.block
from
        V$lock          lck,
        v$px_session    ses
where
        ses.sid = lck.sid
-- and     lck.sid in (select sid from v$session where username = 'TEST_USER')
order by
        case when ses.qcsid = ses.sid then 0 else 1 end,
        ses.qcsid,ses.server_group, ses.server_set, ses.server#,
        lck.sid, lck.type desc , lck.id1, lck.id2
;

spool off

set serveroutput on
alter session set events '10046 trace name context off';
alter session set statistics_level = typical;

You’ll see that I’ve allowed for code to create a couple of indexes as that may affect the way that Oracle behaves; I’ve also enabled and reported rowsource execution statistics so that I can get an idea of the volume of data moving through each process. The query against v$lock and v$px_session is the reason I wrote the script 13 years ago – there are some unexpected locking details when parallel DML is involved, but I won’t be looking at those in this note.

The Tablescan path

Here’s the execution plan from 19.11.0.0 when there are no indexes on the table. (The table has about 64,000 rows spread over nearly 1,300 blocks):

----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT      |          |      1 |        |        |      |            |      4 |00:00:00.49 |      13 |      3 |
|   1 |  PX COORDINATOR       |          |      1 |        |        |      |            |      4 |00:00:00.49 |      13 |      3 |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |      0 |   2465 |  Q1,00 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |
|   3 |    DELETE             | T1       |      2 |        |  Q1,00 | PCWP |            |      0 |00:00:00.94 |   54061 |   1288 |
|   4 |     PX BLOCK ITERATOR |          |      2 |   2465 |  Q1,00 | PCWC |            |  44733 |00:00:00.14 |    1345 |   1267 |
|*  5 |      TABLE ACCESS FULL| T1       |     26 |   2465 |  Q1,00 | PCWP |            |  44733 |00:00:00.05 |    1345 |   1267 |
----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access(:Z>=:Z AND :Z<=:Z)
       filter("OWNER"='SYS')

Note
-----
   - Degree of Parallelism is 2 because of hint

There’s nothing terribly interesting in this plan. We can see that the delete (operation 3) took place in parallel because it appears as a child to the PX send QC (operation 2) which means the PX processes did the deletion before sending the query coordinator messages to tell it what they’d done. We can also infer (before reading the Note) that the degree of parallelism was 2 because the delete operation reports 2 in the Starts column. The 26 starts for the Table Access Full (operation 5) is also something we might expect (for tablescans and index fast full scans) as Oracle has a basic target of 13 rowid ranges per process.

One important “anomaly” is that the plan reports a final A-Time of 0.49 seconds while the delete at operation 3 reports 0.94 seconds. Time collected by enabling rowsource execution statistics should always be treated a little cautiously, but in this case the explanation of the contradiction is simple: 0.94 is the sum of the elapsed times seen by the two PX processes, 0.49 is the elapsed time from the viewpoint of the query coordinator.

Index Effects

Here’s the plan if we repeat the experiment but include the first of the two index creation statements:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT                    |          |      1 |        |        |      |            |      2 |00:00:00.66 |      10 |      3 |       |       |          |
|   1 |  PX COORDINATOR                     |          |      1 |        |        |      |            |      2 |00:00:00.66 |      10 |      3 |   337K|   337K|     1/0/0|
|   2 |   PX SEND QC (RANDOM)               | :TQ10002 |      0 |   2465 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   3 |    INDEX MAINTENANCE                | T1       |      2 |        |  Q1,02 | PCWP |            |      2 |00:00:00.31 |    2290 |      9 |       |       |          |
|   4 |     PX RECEIVE                      |          |      2 |   2465 |  Q1,02 | PCWP |            |  44735 |00:00:00.15 |       0 |      0 |       |       |          |
|   5 |      PX SEND RANGE                  | :TQ10001 |      0 |   2465 |  Q1,01 | P->P | RANGE      |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       DELETE                        | T1       |      2 |        |  Q1,01 | PCWP |            |      0 |00:00:01.14 |   53228 |   1068 |       |       |          |
|   7 |        PX RECEIVE                   |          |      2 |   2465 |  Q1,01 | PCWP |            |  44733 |00:00:00.18 |       0 |      0 |       |       |          |
|   8 |         PX SEND HASH (BLOCK ADDRESS)| :TQ10000 |      0 |   2465 |  Q1,00 | S->P | HASH (BLOCK|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |          PX SELECTOR                |          |      2 |        |  Q1,00 | SCWC |            |  44733 |00:00:00.04 |     380 |    380 |       |       |          |
|* 10 |           INDEX RANGE SCAN          | T1_I1    |      1 |   2465 |  Q1,00 | SCWP |            |  44733 |00:00:00.02 |     380 |    380 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

  10 - access("OWNER"='SYS')

Note
-----
   - Degree of Parallelism is 2 because of hint

The first thing to notice, perhaps, is that the driving operation for the delete is an index range scan rather than a full tablescan. The range scan is called by a PX SELECTOR operation which means only one PX process is used to do it – so the range scan is, indeed, serialized and in the past (prior to 12c, I think) we might have seen serialization like this handled by the query coordinator. Having walked the index in order the single PX process distributes by hashing (PX SEND HASH (BLOCK ADDRESS)) the block ids that it has derived from the rowids it has identified in the index – it’s at this point that the work switches from serial to parallel (IN-OUT showing S->P).

The purpose of the hash distribution by block id is to ensure that the PX processes that do the actual deletion don’t end up colliding on table blocks and introducing buffer busy waits and problems with ITL waits. Because I flushed the buffer cache I can say that the 1,068 blocks read by the delete operation correspond to the number of table blocks that hold rows where owner = ‘SYS’ and if I enable the extended SQL trace (10046, level 8) I can see that the blocks read by the two relevant PX processes are made up of two disjoint sets.

This pattern of behaviour may, of course, explain why the OP thought the delete was running serially and only one PX process was working: with 48 processes live (which means DOP 24 for this index-driven plan) and only one PX process doing a range scan and spreading table blocks across 24 processes, a relatively small delete might give the impression that the one process called by the PX Selector was the only one doing anything.

Moving on up the plan we can see that we now do a range based distribution to the “next” set of PX processes that handle the index maintenance. Range-based so that the two processes don’t collide on index leaf blocks (except, perhaps, for one leaf block where the two ranges meet – that’s a detail to be investigated). Once the index has been updated the PX processes sends a completion message to the QC.

It’s interesting to note that when we delete 44,733 rows from the table it takes 53,228 buffer gets, but when we maintain (delete) the same number of index entries it takes only 2,290 buffer gets. This suggests that the table deletes might be operating as single rows deletes while the index deletes are array (block) based.

There are always more things to check, of course, when investigating a mechanism – taking a snapshot of the session stats is a necessity if, for example, you want to find out what the buffer gets were for. I didn’t do that, but one other thing I did check was the contents of the view v$pq_tqstat so that I could address the question of whether the PX SEND HASH (BLOCK ADDRESS) operation actually spread the work well or whether all the rowid information went to a single PX process in the next set, making the plan effectively operate like a serial plan.

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCYa
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P000                 44733    2877765    100.00     100.00          0          0           0
                                             1 P001                     0         48      0.00       0.00          0          0           0
                      Consumer               1 P002                 23394    1499012     52.30      52.09         29         26           0
                                             1 P003                 21339    1378801     47.70      47.91         30         27           0

                    1 Ranger                 1 QC                       2     180625    100.00     100.00          5          2           0
                      Producer               1 P002                 24565    1968590     52.30      52.14          2          1           0
                                             1 P003                 22407    1806894     47.70      47.86          2          1           0
                      Consumer               1 P000                 22353    1749518     49.97      48.67         34         27           0
                                             1 P001                 22382    1845341     50.03      51.33         42         35           0

                    2 Producer               1 P000                     1        154     50.00      50.00         13          4           0
                                             1 P001                     1        154     50.00      50.00         11          4           0
                      Consumer               1 QC                       2        308    100.00     100.00         63         19           0

Reading the table queue IDs (TQ_ID) from top to bottom we can see that P000 was obviously the selected process that ran the index range scan, and it split the data a little unevenly between P002 and P003. It’s not immediately obvious how P002 and P003 can report more rows forwarded (46,972) to P000 and P001 for index maintenance than originally arrived (44,733), but it seems a little suspicious that the excess is extremely close to 5%, and the excess disappears when you check the rows consumed by P000 and P001.

Check the number of rows and bytes consumed by the QC in the guise of a Ranger, though:180,625 is a lot of bytes for 2 rows, so maybe that’s actually 2 packets (one per producer PX process) and maybe it’s the QC that has consumed the 5% excess as its ranging sample – as you can check by doing the arithmetic for the bytes reported in tq_id number 1: P002 + P003 = P000 + P001 + QC

More Indexes

Naturally there are all sorts of variations on this delete that we could investigate – what happens with partitioned tables, what if they have local indexes, what about global and globally partitioned indexes. Do IOTs and bitmap indexes have any interesting effects. What about domain indexes of different types. I’m not going to push the envelope in any exciting direction, but it’s worth checking what happens in the very common case where there are multiple indexes – how much changes when I re-run the test but create the second index.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | DELETE STATEMENT                    |          |      1 |        |        |      |            |      2 |00:00:01.10 |      10 |      3 |       |       |          |
|   1 |  PX COORDINATOR                     |          |      1 |        |        |      |            |      2 |00:00:01.10 |      10 |      3 |   337K|   337K|     2/0/0|
|   2 |   PX SEND QC (RANDOM)               | :TQ10002 |      0 |   2465 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   3 |    INDEX MAINTENANCE                | T1       |      2 |        |  Q1,02 | PCWP |            |      2 |00:00:00.97 |    4836 |    482 |       |       |          |
|   4 |     PX RECEIVE                      |          |      2 |   2465 |  Q1,02 | PCWP |            |  89468 |00:00:00.30 |       0 |      0 |       |       |          |
|   5 |      PX SEND RANGE                  | :TQ10001 |      0 |   2465 |  Q1,01 | P->P | RANGE      |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       DELETE                        | T1       |      2 |        |  Q1,01 | PCWP |            |      0 |00:00:01.57 |   53207 |   1068 |       |       |          |
|   7 |        PX RECEIVE                   |          |      2 |   2465 |  Q1,01 | PCWP |            |  44733 |00:00:00.18 |       0 |      0 |       |       |          |
|   8 |         PX SEND HASH (BLOCK ADDRESS)| :TQ10000 |      0 |   2465 |  Q1,00 | S->P | HASH (BLOCK|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |          PX SELECTOR                |          |      2 |        |  Q1,00 | SCWC |            |  44733 |00:00:00.04 |     380 |    380 |       |       |          |
|* 10 |           INDEX RANGE SCAN          | T1_I1    |      1 |   2465 |  Q1,00 | SCWP |            |  44733 |00:00:00.02 |     380 |    380 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The only significant differences between this plan and the previous plan appears at the operations relating to index maintenance. We see the A-Rows value for the PX RECEIVE at operation 4 has doubled (there are now 2 sets of index entries to handle rather than 1), and the number of buffer visits and disk reads have gone up from 2,290 and 9 respectively to 4,836 and 482 respectively as the 2nd index has to be updated (remember the first index would have been well cached at this point because it was used to identify the rows to be deleted).

Summary

There are many reasons why Oracle will refuse to use parallel DML, even when hinted or “forced”. You can, however, be fooled into thinking that it’s not using parallel DML when it is but the pattern of the data and work distribution introduces (possibly temporarily) an extreme skew in the diagnostic data you’re looking at. It’s best to model what you’re doing at a small scale so that you can investigate the options that you are likely to see before you run into an unusual case in production.

Footnote

I said that I’d written this little script some time ago to check on the locking that took place for operations like a parallel update etc. The results may be a little surprising if you’ve not thought about them before. What locks can you hold on a table if someone else it holding an exclusive lock? You might think there was a clue is in the word “exclusive”, but perhaps not really. Here’s the state of the locks I was holding for the indexed case after I had done the delete but before the commit (the tablescan case is very similar but report only 2 PX sessions):

     QCSID SERVER_GROUP SERVER_SET    SERVER#        SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------- ------------ ---------- ---------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
        30                                            30 TX     589843      21080          6          0          1          0
                                                         TM     181671          0          6          0          1          0
                                                         PS          1          0          4          0          1          0
                                                         PS          1          1          4          0          1          0
                                                         PS          1          2          4          0          1          0
                                                         PS          1          3          4          0          1          0
                                                         AE        134    3792595          4          0          7          0

                      1          1          1        192 TX     458756      18683          6          0          1          0
                                                         TM     181671          0          1          0          1          0
                                                         TM     181671          1          4          0          1          0
                                                         PS          1          0          4          0          1          0

                                            2         17 TX     262167      18430          6          0          1          0
                                                         TM     181671          0          1          0          1          0
                                                         TM     181671          1          4          0          1          0
                                                         PS          1          1          4          0          1          0

                                 2          1         41 TX     196620      20448          6          0          1          0
                                                         TM     181671          0          1          0          1          0
                                                         TM     181671          1          4          0          1          0
                                                         PS          1          2          4          0          1          0

                                            2        207 TX     655383      20651          6          0          1          0
                                                         TM     181671          0          1          0          1          0
                                                         TM     181671          1          4          0          1          0
                                                         PS          1          3          4          0          1          0

The AE lock is the Editioning lock and isn’t currently of interest.

The QC and the 4 PX sessions all hold separate TX (transaction) locks in exclusive mode. The QC also holds 4 PS locks, which represent the locks on the PX processes (the id1 / id2 values correspond to the inst_id / server_name columns in gv$px_process). These PS locks are held in share mode and are also held by the PX sessions – one each – identifying which process they correspond to. Finally, the QC is holding an exclusive lock on the table (TM / 181671, mode 6).

Despite the “exclusiveness” of the QC’s TM lock, the PX sessions are also holding locks on that table – in fact two locks each, one in mode 1 the other in mode 4. However you will note that the mode 1 lock is held on a TM lock with the (usual) id2 value of 0; while the mode 4 lock is held on a TM lock with the (unusual) id2 value of 1.

The main point of this footnote – which once was planned as the complete article based on the script – is simply to point out that when you start playing around with feature that are new to you, it’s a good idea to investigate some of the basic v$ views to find out what unusual variations you are likely to see if you put the feature into production.

Comments (2)

July 12, 2023

Parallel DML – not

Filed under: Oracle,Parallel Execution,Troubleshooting — Jonathan Lewis @ 2:21 pm BST Jul 12,2023

A fairly common question about parallel DML is: “Why isn’t it happening?” A fairly common (correct) answer to this question is: “Because you haven’t enabled it.” Unlike parallel query and parallel DDL which are enabled by default parallel DML could cause strange deadlocking side-effects so you have to enable it explicitly before you use it (and disable it afterwards), with the statements like:

alter session enable parallel dml;
alter session force parallel dml;
alter session force parallel dml parallel 3;
alter session disable parallel dml;

There are, however, many reasons why Oracle will disallow parallel DML (e.g. this example from 2013) and some of them are listed in the documentation (Link is for 19c). Bear in mind that some things that are listed are no longer true following updates, or are only true in particular circumstances; and some things that block parallel DML are not listed, or are described in such a way that they don’t appear to be listed.

To keep myself up to date I search the executable from time to time for “revealing phrases”, starting with a Linux execution of “strings -a oracle”. For parallel DML I piped the results through “grep -n -i PDML”. I won’t show you the entire (307 line) output from the call to grep, but I will pick out a set of consecutive lines which look very promising:

1827860:PDML disabled in session
1827861:no intrapart PDML and instead-of trigger on view
1827862:no intrapart PDML and only one partition
1827863:no intrapart PDML and two partitions or greater
1827865:PDML disabled for MERGE using conventional insert
1827866:no PDML for domain index during delete/merge/update
1827867:no PDML on global temp table
1827868:default PDML not allowed
1827869:no PDML for domain indexes that is not parallel ready
-- always PIDL on cursor temp table
1827871:no PDML. IPD disabled on table with bitmap index
1827872:PDML disabled due to MTI with conventional load
1827873:PDML enabled in session
-- non-partitioned IOT
-- table locks disabled
1827876:DML violated a PDML restriction
1827877:kkfdtpdml
1827878:ctxflg:%x ctxxyfl:%x pdml %sallowed
1827879:object reference. PDML disallowed. kctflg:0x%x
1827880:arrays. PDML disallowed. kctflg:0x%x
1827881:nested tables. PDML disallowed. kctflg:0x%x
1827882:adts. PDML disallowed. kctflg:0x%x
1827883:delete with retry on row change hint. PDML disallowed
1827884:PDML disallowed for insert-values
1827885:PDML disallowed: Memoptimize for Write insert
-- MTI uniq constr violation [ed: MTI = Multi-table insert ??]
1827887:PDML disallowed: conv insert into temp tab
1827888:PDML disallowed:conv insert into single frag && !IPD
1827889:ignore/change dupkey hint on INSERT. PDML disallowed
1827890:kkfdpdml_restrictions_10
1827891:PDML disallowed: MTI and HWM brokering not allowed
1827892:Complex ADT: PDML disallowed.
1827893:arrays. PDML disallowed. kctflg: 0x%x
-- arrays. PIDL disallowed. Not stored as securefile.
1827895:nested tables. PDML disallowed. kctflg: 0x%x
1827896:typed subquery. PDML disallowed. kctflg: 0x%x
1827897:object references. PDML disallowed. kctflg: 0x%x
1827898:kkfdpdml_restrictions: PDML allowed
1827899:PDML disallowed: Transaction duration Private Temp Table
1827900:Table with ADT columns and VPD Policy: PDML disallowed
1827901:update with retry on row change hint. PDML disallowed
1827902:serializable update. PDML disallowed
1827903: not in parallel long to lob. PDML disallowed
1827904:kkfdpdml_restrictions
-- Basicfile LOB column
1827906:remote table. PDML disallowed
1827907:fro is a view. PDML disallowed
1827908:PDML disallowed: query block can produce at most one row

You’ll notice that there are a number of occurrences of “PDML disallowed” in this list, which makes it look as if it may be the definitive list of notes that appear either in the 10053 trace file or the Notes section of the execution plan, or in the PX trace. There were, however, a few gaps as far as “consecutive” lines were concerned, so I went back to the strings output to fill them – these are the lines marked with “–“

I’m not going to expand on the content or relevance of the list, it’s just something to leave in place as a quick clue to check if you ever find that your DML isn’t executing parallel when you expect it to – do your circumstances seem to match any of the above conditions.

Addendum

You’ll note that a number of the strings reported above include the text: “kctflg: 0x%x”. Having noticed that I was getting some “PDML disabled” comments that weren’t in the list above I did a further text search in the executable and found a set of consecutive strings that might be the values that can go into the place-holders, and so may be further clues when you fail to see parallel DML (or, in some cases of the below, direct path loads):

delete cascade used
workspaces used
deferrable constraints are used
parent referential constraints are present
child constraint is deferrable
returning clause is present
array binds are used
triggers are defined
non partitioned or single fragment IOT used
remote table used
clustered table used
IOT with rowid mapping table used
partitioned IOT with MV dependecy used
table has one or more nested table columns
table has one or more array columns
table has one or more object reference columns
ADTs are used
temporary tables with indexes used
partitioned temporary tables used
LOBs are used with row movement possible
bitmap indexes are defined with row movement possible
bitmap indexes are defined
child ref-partitioned table may need cascaded migration
table has one or more lob columns
merge statement used on an IOT
table has one or more opaque column
cube organized table used
function is not pure and not declared parallel enabled
query block can produce at the most one row
transaction isolation level is serializable
retry on row change hint used
typed subquery used
ignore or change duplicate key hint was used
insert values was used
single fragment or non partitioned table used
MTI used on tables with unique constraints
parallel execution either disabled or not enabled in session
object is not decorated with parallel clause
policies with referential integrity constraints were added
domain indexes are defined
merge statement with conventional insert used
table locks have been disabled
index is not parallel ready
PDML is disabled in current session
table has one or more XMLType columns
instead of triggers defined
unique constraints are enforced by non-unique index
object has FGA policy
no append hint given and not executing in parallel
merge statement has only an update branch
function or domain index defined on the LOB column
check constraints are defined
noappend hint used
insert values with no append values hint used
_direct_path_insert_features parameter used to disable IDLV
row_locking parameter value is not INTENT

Comments (3)

January 15, 2023

Quiz Night

Filed under: Infrastructure,Oracle,Parallel Execution,Troubleshooting — Jonathan Lewis @ 6:25 pm GMT Jan 15,2023

Here’s a little extract from one of my “snap instance activity stats” packages picking out the figures where the upper case of the statistic name is like ‘%PARALLEL%’. I was very careful that nothing except one of my SQL*Plus sessions had done anything in the few seconds between the start and end snapshots so there’s no “(un)lucky timing” to explain the anomaly in these figures.

The quesion is: how can Oracle manage to claim hundreds of “checkpoint buffers written for parallel query” when there had been no parallel statements executing around the time the snapshots were taken?

Name                                                        Value
----                                                   ----------
DBWR parallel query checkpoint buffers written              1,430
queries parallelized                                            0
DML statements parallelized                                     0
DDL statements parallelized                                     0
DFO trees parallelized                                          0
Parallel operations not downgraded                              0
Parallel operations downgraded to serial                        0
Parallel operations downgraded 75 to 99 pct                     0
Parallel operations downgraded 50 to 75 pct                     0
Parallel operations downgraded 25 to 50 pct                     0
Parallel operations downgraded 1 to 25 pct                      0

Here’s a little background information if you don’t know why this is a puzzle.

When you start executiing a parallel tablescan (or index fast full scan) the first step that your session takes is to tell DBWR to write to disc any dirty blocks for the object you’re about to scan and wait for DBWR to confirm that the blocks have been written. The session does this because a parallel scan will bypass the buffer cache and read directly from disc to the PGA, so if there were any dirty blocks for the object in the buffer cache the query would not find them and get them to a correctly read-consistent state. The blocks written by DBWR in this case would (should) be reported under “DBWR parallel query checkpoint buffers written”.

Answer

It didn’t take long for the correct answer to appear in the comments. I updated a large number of rows in a table, then I set the parameter “_serial_direct_read” to “always” in my session and executed a query that did a tablescan of that table.

My session called the database writer to do an “object checkpoin”, then started waiting on event “enq: KO – fast object checkpoint” before reading the table using direct path reads of (just under) 1MB each.

I published this note because a question came up on the Oracle developer forum which (as a side note) had noted the apparent contradiction in a Statspack report, and I thought it was a useful lesson covering two important points.

First: the Oracle code keeps evolving, and the instrumentation and statistics don’t always keep up; in this case I think there’s a code path that says “fast object checkpoint”, but doesn’t have a flag that separates the newer serial scan option from the original parallel scan option – hence serial scans triggering “parallel query checkpoints”.

Secondly: when you see some performance figures that don’t seem to make sense or contradict your previous experience, it’s worth thinking around the oddity to see if you can come up with a feasible (and testable) explanation along the lines of “what if Oracle’s doing something new but calling some old instrumentation”.

Footnote

One of the comments includes some questions about how the whole parallel query / read-consistency / checkpointing works if more DML is happening as the query runs. I’ll come back to that in a couple of days.

Comments (7)

September 9, 2022

Parallel Default

Filed under: Oracle,Parallel Execution,Troubleshooting — Jonathan Lewis @ 10:25 am BST Sep 9,2022

“Why did my query go parallel?”

It’s a question that crops up from time to time, usually followed by a list of reasons why it shouldn’t have gone parallel – no hints in the query, table is not declared parallel, parallel_degree_policy is set to manual etc.

When the question appeared recently on the Oracle developer forum it turned out that the table in question was declared as “parallel (degree default)”, which prompted the OP to ask the question: “is parallel = default not equivalent to parallel = 1”.

The answer to the question is that the two options are not equivalent – but that’s not the point of this note. Here’s a little script to test the claim:

drop table t1 purge;

create table t1 pctfree 90 as select * from all_objects where rownum <= 50000;

select degree, instances from user_tables where table_name = 'T1';

explain plan for select sum(object_id) from t1;
select * from table(dbms_xplan.display);

alter table t1 parallel (degree default);
select degree, instances from user_tables where table_name = 'T1';

explain plan for select sum(object_id) from t1;
select * from table(dbms_xplan.display);

I’ve created a table in the simplest possible way, but picked a fixed number of rows (to help reproducibility) and – because parallel is usually about “big” objects – I’ve left a lot of empty space (90%) in each block.

Then I’ve checked the execution plan for a very simple query that can only do a full tablescan, with the two declarations of parallelism set.

Here are the outputs of the 4 queries I’ve run:

DEGREE                                   INSTANCES
---------------------------------------- ----------------------------------------
         1                                        1

1 row selected.


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  1275   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 50000 |   244K|  1275   (2)| 00:00:01 |
---------------------------------------------------------------------------

9 rows selected.


DEGREE                                   INSTANCES
---------------------------------------- ----------------------------------------
   DEFAULT                                        1

1 row selected.



PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3110199320

----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     5 |   350   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 50000 |   244K|   350   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 50000 |   244K|   350   (0)| 00:00:01 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 4 because of degree limit

17 rows selected.

Clearly “parallel default” does not have the same effect as “parallel 1”. Any time you’ve got a query unexpectedly running parallel it’s possible that some table (or index on the table) has been created with a parallel degree of default. (More commonly, someone may have rebuilt an index “parallel N” to get the job done more quickly then forgotten to alter the index back to parallel 1 – or noparallel – afterwards.)

The point of this note, though, is that there are some questions you should not ask until you’ve spent a few minutes thinking about how you might create a model that gives you the answer. There are several reasons for this

The more you do it, the better and faster you get at modelling and understanding – and sometimes you really need to model a complex problem because you’re not allowed to show anything that looks like production in public.
If the simple model seems to disagree with the behaviour you see in production it may give you some clues about where to look in the production system for the source of the difference.
If the answer isn’t what you thought it would be you can change the question you put publicly to: “I thought Oracle would do X but it did Y; here’s how I tested, is there a flaw in the test?”

It took about 5 minutes for me to run up this demo – that might seem a bit quick but I’ve had a lot of practice (and it took a lot longer to write the note) – and it was, in this case, a waste of my time because I knew the answer; but I often run up little models before responding to questions on the forums or listservers because while I often think I know what the answer “ought” to be I do like to check before I say something that might be incorrect.

Comments (7)

June 17, 2022

Case Study

Filed under: Execution plans,Oracle,Parallel Execution,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 1:25 pm BST Jun 17,2022

The question “How do you trouble-shoot a performance problem” came up in an online session I did for the AIOUG (All-India OUG) today. It’s a very broad question and the only possible answers are either extremely generic, or very specific – so here’s a specific example that I talked about to give some idea of the processes I go through. It’s an example from the Ora cle-l list server asking the (paraphrased) question:

I’ve got a parallel query that take 24 seconds to return 2,500 rows for 2018. .The same query for 2019 should return a similar amount of data but consumes a lot of TEMP space before failing; it takes 45 minutes to complete if I remove the parallel hint. The only difference between the two queries is a change to a single predicate: “opclf.year_number = to_number(‘YYYY’)” and the statistics are up to date – what’s going on?

The ease of diagnosing a problem is dependent on the ease of access to all the right information, and you don’t always know initially what the right information might be. In this case the OP had started by posting to github the SQL Monitor reports for the two queries, which were automatically generated since the queries were using parallel execution.

Click here to show/hide the first few sections for the 2019 output

SQL Text
------------------------------
SELECT /*+ PARALLEL(8) */ DISTINCT rd.document_id AS doc_id ,'firm_id' AS criteria_key ,opf.ultimate_parent_firm_id AS series_id ,period_number FROM ( SELECT /*+ PARALLEL(8) */ DISTINCT rd.document_id ,rd.client_role_id ,( CASE WHEN MONTHS_BETWEEN(TO_DATE('04/28/2022', 'MM/DD/YYYY'), TO_DATE('04/01/2017', 'MM/DD/YYYY')) > 12 THEN TRUNC(dc.date_value, 'YEAR') ELSE TRUNC(dc.date_value, 'MONTH') END ) period_number FROM REPORT_ENGINE.date_code dc ,REPORT_ENGINE.lit_fact_bd rd INNER JOIN
report_engine.firm FIRM ON rd.firm_id = FIRM.firm_id WHERE dc.date_value BETWEEN TO_DATE('04/01/2017', 'MM/DD/YYYY') AND TO_DATE('04/28/2022', 'MM/DD/YYYY') AND rd.publication_date_id = dc.date_id AND rd.year_number = to_number('2019') AND (FIRM.ultimate_parent_firm_id IN (to_number('252094'))) ) rd INNER JOIN report_engine.opposing_counsel op ON rd.client_role_id = op.legal_role_id INNER JOIN report_engine.lit_fact_bd opclf ON opclf.document_id = rd.document_id AND op.opposing_counsel_role_id
= opclf.client_role_id AND opclf.year_number = to_number('2019') INNER JOIN report_engine.firm opf ON opclf.firm_id = opf.firm_id AND opf.firm_id >= 1000 WHERE 1 = 1

Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  1
 Session             :  SYS (647:28741)
 SQL ID              :  3cjg20q2vw41f
 SQL Execution ID    :  16777216
 Execution Started   :  06/09/2022 05:08:24
 First Refresh Time  :  06/09/2022 05:08:25
 Last Refresh Time   :  06/09/2022 05:08:27
 Duration            :  221s
 Module/Action       :  sqlplus@c111dhw (TNS V1-V3)/-
 Service             :  SYS$USERS
 Program             :  sqlplus@c111dhw (TNS V1-V3)

Global Stats
====================================================================
| Elapsed |   Cpu   |    IO    |  Other   | Buffer | Write | Write |
| Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs  | Bytes |
====================================================================
|     222 |     179 |       28 |       15 |    49M | 11624 |   2GB |
====================================================================

Parallel Execution Details (DOP=8 , Servers Allocated=16)
=====================================================================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    |  Other   | Buffer | Write | Write |         Wait Events         |
|                |       |         | Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs  | Bytes |         (sample #)          |
=====================================================================================================================================
| PX Coordinator | QC    |         |    0.26 |    0.26 |          |     0.00 |  12144 |       |     . |                             |
| p000           | Set 1 |       1 |         |         |          |          |        |       |     . |                             |
| p001           | Set 1 |       2 |         |         |          |          |        |       |     . |                             |
| p002           | Set 1 |       3 |         |         |          |          |        |       |     . |                             |
| p003           | Set 1 |       4 |         |         |          |          |        |       |     . |                             |
| p004           | Set 1 |       5 |         |         |          |          |        |       |     . |                             |
| p005           | Set 1 |       6 |         |         |          |          |        |       |     . |                             |
| p006           | Set 1 |       7 |         |         |          |          |        |       |     . |                             |
| p007           | Set 1 |       8 |         |         |          |          |        |       |     . |                             |
| p008           | Set 2 |       1 |     220 |     177 |       28 |       15 |    48M | 11624 |   2GB | direct path write temp (28) |
| p009           | Set 2 |       2 |         |         |          |          |        |       |     . |                             |
| p010           | Set 2 |       3 |         |         |          |          |        |       |     . |                             |
| p011           | Set 2 |       4 |    1.71 |    1.70 |          |     0.01 |   595K |       |     . |                             |
| p012           | Set 2 |       5 |         |         |          |          |        |       |     . |                             |
| p013           | Set 2 |       6 |         |         |          |          |        |       |     . |                             |
| p014           | Set 2 |       7 |         |         |          |          |        |       |     . |                             |
| p015           | Set 2 |       8 |         |         |          |          |        |       |     . |                             |
=====================================================================================================================================

You’ll note that I haven’t got as far as the body of the execution plan yet, and I’ve highlighted line 44 – a line in the middle of the summary of activity for the parallel execution processes. There are 8 servers in each of two sets (we’re running parallel 8) and the line I’ve highlighted is the first server of the second set. The reason I’ve highlighted it is that it’s the one server that’s doing almost all the work – effectively the query (at some point in the plan) is running serially.

So, a first thought, maybe we’ve just been unlucky and running parallel 7 (say) would result in a better distribution of data across parallel servers and allow each of 7 processes to do whatever they had to do to a much smaller amount of data. Maybe a change to the distribution method (pq_distribute() hint) would spread the workload more evenly. In either case “N” smaller workload might still use a lot of TEMP, but possibly no individual process would fail, and the job would complete nearly N times as quickly.

Note: the 2018 Monitor report showed an equivalent skew in the data distribution, but the amount of work needed was much smaller with a read/write load of only 143MB compared to 2GB for the 2019 report. The OP did try running parallel 7, but with no change to the overall effect.

Let’s take a quick glance at the plan body (click to view/hide)

SQL Plan Monitoring Details (Plan Hash Value=1862006233)
=========================================================================================================================================================================================================================
| Id    |                     Operation                      |           Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Write | Write | Mem | Temp | Activity |        Activity Detail       |
|       |                                                    |                           | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs  | Bytes |     |      |   (%)     |         (# samples)         |
=========================================================================================================================================================================================================================
|     0 | SELECT STATEMENT                                   |                           |         |      |           |        |     3 |          |       |       |     |      |   |                             |
|     1 |   PX COORDINATOR                                   |                           |         |      |           |        |     3 |          |       |       |     |      |   |                             |
|     2 |    PX SEND QC (RANDOM)                             | :TQ10003                  |     781 |  153 |           |        |       |          |       |       |     |      |   |                             |
|     3 |     HASH UNIQUE                                    |                           |     781 |  153 |           |        |       |          |       |       |     |      |   |                             |
|     4 |      PX RECEIVE                                    |                           |     781 |  151 |           |        |       |          |       |       |     |      |   |                             |
|     5 |       PX SEND HASH                                 | :TQ10002                  |     781 |  151 |           |        |     2 |          |       |       |     |      |   |                             |
|  -> 6 |        BUFFER SORT                                 |                           |     781 |  153 |       219 |     +3 |     2 |        0 | 11624 |   2GB | 61M |   2G |    26.21 | Cpu (26)                     |
|       |                                                    |                           |         |      |           |        |       |          |       |       |     |      |   | direct path write temp (28) |
|  -> 7 |         NESTED LOOPS                               |                           |     781 |  151 |       219 |     +3 |     2 |      39M |       |       |     |      |   |                             |
|  -> 8 |          NESTED LOOPS                              |                           |     781 |  151 |       219 |     +3 |     2 |      39M |       |       |     |      |     0.49 | Cpu (1)                      |
|  -> 9 |           NESTED LOOPS                             |                           |     777 |   44 |       219 |     +3 |     2 |      39M |       |       |     |      |   |                             |
| -> 10 |            NESTED LOOPS                            |                           |      41 |   26 |       219 |     +3 |     2 |     6463 |       |       |     |      |   |                             |
| -> 11 |             HASH JOIN                              |                           |      41 |   21 |       219 |     +3 |     2 |     6463 |       |       |  6M |      |   |                             |
|    12 |              BUFFER SORT                           |                           |         |      |         1 |     +3 |     2 |    36855 |       |       |     |      |   |                             |
|    13 |               PX RECEIVE                           |                           |      87 |   19 |         1 |     +3 |     2 |    36855 |       |       |     |      |   |                             |
|    14 |                PX SEND HASH                        | :TQ10000                  |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    15 |                 NESTED LOOPS                       |                           |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    16 |                  NESTED LOOPS                      |                           |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    17 |                   TABLE ACCESS BY INDEX ROWID      | FIRM                      |       1 |    2 |         1 |     +3 |     1 |       43 |       |       |     |      |   |                             |
|    18 |                    INDEX RANGE SCAN                | FIRM_ULT_PARENT_FIRM_IDX1 |       1 |    1 |         1 |     +3 |     1 |       43 |       |       |     |      |   |                             |
|    19 |                   PARTITION RANGE SINGLE           |                           |         |      |         1 |     +3 |    43 |    38694 |       |       |     |      |   |                             |
|    20 |                    BITMAP CONVERSION TO ROWIDS     |                           |         |      |         1 |     +3 |    43 |    38694 |       |       |     |      |   |                             |
|    21 |                     BITMAP INDEX SINGLE VALUE      | LIT_FACT_BD_IDX09         |         |      |         1 |     +3 |    43 |       49 |       |       |     |      |   |                             |
|    22 |                  TABLE ACCESS BY LOCAL INDEX ROWID | LIT_FACT_BD               |      63 |   19 |         3 |     +1 | 38694 |    38694 |       |       |     |      |     0.49 | Cpu (1)                      |
|    23 |              PX RECEIVE                            |                           |      20 |    2 |         1 |     +3 |     2 |        2 |       |       |     |      |   |                             |
|    24 |               PX SEND HASH                         | :TQ10001                  |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
|    25 |                PX BLOCK ITERATOR                   |                           |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
|    26 |                 TABLE ACCESS FULL                  | OPPOSING_COUNSEL          |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
| -> 27 |             TABLE ACCESS BY INDEX ROWID            | DATE_CODE                 |       1 |      |       219 |     +3 |  6465 |     6463 |       |       |     |      |   |                             |
| -> 28 |              INDEX UNIQUE SCAN                     | PK_DATE_CODE              |       1 |      |       219 |     +3 |  6465 |     6465 |       |       |     |      |   |                             |
| -> 29 |            PARTITION RANGE SINGLE                  |                           |      19 |      |       219 |     +3 |  6465 |      39M |       |       |     |      |   |                             |
| -> 30 |             TABLE ACCESS BY LOCAL INDEX ROWID      | LIT_FACT_BD               |      19 |      |       220 |     +2 |  6465 |      39M |       |       |     |      |    35.92 | Cpu (74)                     |
| -> 31 |              INDEX RANGE SCAN                      | LIT_FACT_BD_IDX20         |       1 |      |       219 |     +3 |  6465 |      39M |       |       |     |      |     9.22 | Cpu (19)                     |
| -> 32 |           INDEX UNIQUE SCAN                        | PK_FIRM                   |       1 |      |       219 |     +3 |   39M |      39M |       |       |     |      |    10.68 | Cpu (22)                     |
| -> 33 |          TABLE ACCESS BY INDEX ROWID               | FIRM                      |       1 |      |       219 |     +3 |   39M |      39M |       |       |     |      |    16.99 | Cpu (35)                     |
===================================================================================================================================================================

You can see from the “->” symbols at the left hand side of the plan that this report was generated while the plan was still running. The thing that leaps out as you glance down the page is the value in the “Rows (Actual)” column at operations 7-9 (which show the rowsources generated by some nested loop joins) and operations 29, 32 and 33 of the plan that tell us something about how those rowsources were generated.

Operation 29 has executed (Execs) 6,465 so far, producing a total of 39M rows, and operations 32 and 33 have both executed 39M times each producing a total of 39M rows by index unique scan.

The plan for the 2018 data was similar though the join order for DATE_CODE, LIT_FACT_BD and FIRM was different (and it was the join to LIT_FACT_BD that increased the row count dramatically – so hinting it to be the last table in the join might help a bit), but the largest rowcount for the 2018 query was only 3M rows, not the 39M that had appeared after only 6,465 rows of a possible driving 39,855 in the 2019 query.

So it’s not surprising that the query could take much longer for 2019. It’s not the volume of output that matters, it’s the volume of input (or, more accurately, throughput or intermediate) data that matters.

Let’s think about that volume, though: the 2018 plan generated 3M rows and then crunched them down to 2,500 rows and the 2019 plan was supposed to produce a similar sized output (from 39M+ rows). Could we rewrite the query in some way that made it do some intermediate aggregation so that the volume of data to be aggregated was never enormous?

Let’s take a look at the plan from the 2018 Monitor report (click to show/hide)

SQL Plan Monitoring Details (Plan Hash Value=472871521)
=======================================================================================================================================================================================================================================
| Id |                     Operation                      |           Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  | Write | Write |  Mem  | Temp  | Activity |       Activity Detail       |
|    |                                                    |                           | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)     |        (# samples)         |
=======================================================================================================================================================================================================================================
|  0 | SELECT STATEMENT                                   |                           |         |      |         1 |    +24 |    17 |     2613 |      |       |       |       |       |       |   |                            |
|  1 |   PX COORDINATOR                                   |                           |         |      |         1 |    +24 |    17 |     2613 |      |       |       |       |       |       |   |                            |
|  2 |    PX SEND QC (RANDOM)                             | :TQ10003                  |       1 |   39 |         1 |    +24 |     8 |     2613 |      |       |       |       |       |       |   |                            |
|  3 |     HASH UNIQUE                                    |                           |       1 |   39 |         9 |    +16 |     8 |     2613 |      |       |       |       |    9M |       |     6.90 | Cpu (2)                     |
|  4 |      PX RECEIVE                                    |                           |       1 |   38 |         9 |    +16 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  5 |       PX SEND HASH                                 | :TQ10002                  |       1 |   38 |        12 |    +14 |     8 |       3M |      |       |       |       |       |       |     3.45 | Cpu (1)                     |
|  6 |        BUFFER SORT                                 |                           |       1 |   39 |        23 |     +2 |     8 |       3M | 4584 | 143MB |   703 | 143MB |  151M |  151M |    34.48 | Cpu (2)                     |
|    |                                                    |                           |         |      |           |        |       |          |      |       |       |       |       |       |   | direct path read temp (6)  |
|    |                                                    |                           |         |      |           |        |       |          |      |       |       |       |       |       |   | direct path write temp (2) |
|  7 |         NESTED LOOPS                               |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  8 |          NESTED LOOPS                              |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  9 |           NESTED LOOPS                             |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
| 10 |            NESTED LOOPS                            |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
| 11 |             HASH JOIN                              |                           |      41 |   21 |        15 |     +2 |     8 |    19334 |      |       |       |       |    7M |       |   |                            |
| 12 |              BUFFER SORT                           |                           |         |      |        13 |     +2 |     8 |    19233 |      |       |       |       |    1M |       |   |                            |
| 13 |               PX RECEIVE                           |                           |      89 |   19 |        13 |     +2 |     8 |    19233 |      |       |       |       |       |       |   |                            |
| 14 |                PX SEND HASH                        | :TQ10000                  |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 15 |                 NESTED LOOPS                       |                           |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 16 |                  NESTED LOOPS                      |                           |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 17 |                   TABLE ACCESS BY INDEX ROWID      | FIRM                      |       1 |    2 |         1 |     +1 |     1 |       43 |      |       |       |       |       |       |   |                            |
| 18 |                    INDEX RANGE SCAN                | FIRM_ULT_PARENT_FIRM_IDX1 |       1 |    1 |         1 |     +1 |     1 |       43 |      |       |       |       |       |       |   |                            |
| 19 |                   PARTITION RANGE SINGLE           |                           |         |      |         1 |     +1 |    43 |    19233 |      |       |       |       |       |       |   |                            |
| 20 |                    BITMAP CONVERSION TO ROWIDS     |                           |         |      |         1 |     +1 |    43 |    19233 |      |       |       |       |       |       |   |                            |
| 21 |                     BITMAP INDEX SINGLE VALUE      | LIT_FACT_BD_IDX09         |         |      |         1 |     +1 |    43 |       51 |      |       |       |       |       |       |   |                            |
| 22 |                  TABLE ACCESS BY LOCAL INDEX ROWID | LIT_FACT_BD               |      64 |   19 |         1 |     +1 | 19233 |    19233 |      |       |       |       |       |       |   |                            |
| 23 |              PX RECEIVE                            |                           |      20 |    2 |        15 |     +2 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 24 |               PX SEND HASH                         | :TQ10001                  |      20 |    2 |         1 |    +14 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 25 |                PX BLOCK ITERATOR                   |                           |      20 |    2 |         1 |    +14 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 26 |                 TABLE ACCESS FULL                  | OPPOSING_COUNSEL          |      20 |    2 |         1 |    +14 |     3 |       20 |      |       |       |       |       |       |   |                            |
| 27 |             PARTITION RANGE SINGLE                 |                           |       1 |      |        15 |     +2 | 19334 |       3M |      |       |       |       |       |       |   |                            |
| 28 |              TABLE ACCESS BY LOCAL INDEX ROWID     | LIT_FACT_BD               |       1 |      |        16 |     +1 | 19334 |       3M |      |       |       |       |       |       |    17.24 | Cpu (5)                     |
| 29 |               INDEX RANGE SCAN                     | LIT_FACT_BD_IDX20         |       1 |      |        15 |     +2 | 19334 |       3M |      |       |       |       |       |       |   |                            |
| 30 |            TABLE ACCESS BY INDEX ROWID             | DATE_CODE                 |       1 |      |        15 |     +2 |    3M |       3M |      |       |       |       |       |       |    10.34 | Cpu (3)                     |
| 31 |             INDEX UNIQUE SCAN                      | PK_DATE_CODE              |       1 |      |        16 |     +1 |    3M |       3M |      |       |       |       |       |       |     6.90 | Cpu (2)                     |
| 32 |           INDEX UNIQUE SCAN                        | PK_FIRM                   |       1 |      |        23 |     +2 |    3M |       3M |      |       |       |       |       |       |     6.90 | Cpu (2)                     |
| 33 |          TABLE ACCESS BY INDEX ROWID               | FIRM                      |       1 |      |        16 |     +1 |    3M |       3M |      |       |       |       |       |       |    13.79 | Cpu (4)                     |
=======================================================================================================================================================================================================================================

We see from operations 3 – 7 that the 3M rows generated from the nested loop joins pass up through a buffer sort operation before being crunched down to 2,613 rows. It’s probably the buffer sort (which is buffering but not sorting) that has mostly passed through a single server and spilled to disc in the 2019 report. We just don’t want that 39M+ rows ever to exist.

So how easy will it be to change the SQL (click to view/hide)

SELECT
        /*+ PARALLEL(8) */
        DISTINCT rd.document_id AS doc_id
        ,'firm_id' AS criteria_key
        ,opf.ultimate_parent_firm_id AS series_id
        ,period_number
FROM (
        SELECT
                /*+ PARALLEL(8) */
                DISTINCT rd.document_id
                ,rd.client_role_id
                ,(
                        CASE 
                                WHEN MONTHS_BETWEEN(TO_DATE('04/28/2022', 'MM/DD/YYYY'), TO_DATE('04/01/2017', 'MM/DD/YYYY')) > 12
                                        THEN TRUNC(dc.date_value, 'YEAR')
                                ELSE TRUNC(dc.date_value, 'MONTH')
                                END
                        ) period_number
        FROM REPORT_ENGINE.date_code dc
                ,REPORT_ENGINE.lit_fact_bd rd
        INNER JOIN report_engine.firm FIRM ON rd.firm_id = FIRM.firm_id
        WHERE dc.date_value BETWEEN TO_DATE('04/01/2017', 'MM/DD/YYYY')
                        AND TO_DATE('04/28/2022', 'MM/DD/YYYY')
                AND rd.publication_date_id = dc.date_id
                AND rd.year_number = to_number('2019')
                AND (FIRM.ultimate_parent_firm_id IN (to_number('252094')))
        ) rd
INNER JOIN report_engine.opposing_counsel op ON rd.client_role_id = op.legal_role_id
INNER JOIN report_engine.lit_fact_bd opclf ON opclf.document_id = rd.document_id
        AND op.opposing_counsel_role_id = opclf.client_role_id
        AND opclf.year_number = to_number('2019')
INNER JOIN report_engine.firm opf ON opclf.firm_id = opf.firm_id
        AND opf.firm_id >= 1000
WHERE 1 = 1;

Lines 7-10 and 27 tell us we alredy have an inline view where we’re doing a “select distinct” and, unwinding the mix of “Oracle” and “ANSI” syntax, we can see that it joins DATE_CODE, LIT_FACT_BD and FIRM, and we know that one of those tables explodes the intermediate data size to something enormous. So it looks like the original author of this code had already worked out that the query needed to aggregate early.

Checking back to the original plans we note that there’s only one “hash unique” operation, and there’s no sign of a “view” operation, so maybe the performance problem is a result of the optimizer suddenly deciding it can do complex view merging with this inline view, and perhaps all we need to do is add the hint /*+ no_merge */ to the inline view and see what happens.

Here’s the plan after adding the hint (click to hide/vew)

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----
| Id  | Operation                                            | Name                      | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----
|   0 | SELECT STATEMENT                                     |                           |       |       |   314 (100)|          |       |       |        |      | |
|   1 |  PX COORDINATOR                                      |                           |       |       |            |          |       |       |        |      | |
|   2 |   PX SEND QC (RANDOM)                                | :TQ10005                  |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | P->S | QC (RAN D)  |
|   3 |    HASH UNIQUE                                       |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | PCWP | |
|   4 |     PX RECEIVE                                       |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | PCWP | |
|   5 |      PX SEND HASH                                    | :TQ10004                  |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,04 | P->P | HASH |
|   6 |       HASH UNIQUE                                    |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   7 |        NESTED LOOPS                                  |                           |    23 |  2599 |   313   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   8 |         NESTED LOOPS                                 |                           |    23 |  2599 |   313   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   9 |          NESTED LOOPS                                |                           |    23 |  2323 |   310   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|* 10 |           HASH JOIN                                  |                           |   388 | 21340 |   148   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  11 |            PX RECEIVE                                |                           |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,04 | PCWP | |
|  12 |             PX SEND BROADCAST                        | :TQ10002                  |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | P->P | BROADCA ST  |
|  13 |              PX BLOCK ITERATOR                       |                           |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWC | |
|* 14 |               TABLE ACCESS FULL                      | OPPOSING_COUNSEL          |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWP | |
|  15 |            VIEW                                      |                           |   835 | 39245 |   146   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  16 |             HASH UNIQUE                              |                           |   835 | 63460 |   146   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  17 |              PX RECEIVE                              |                           |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  18 |               PX SEND HASH                           | :TQ10003                  |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,03 | P->P | HASH |
|* 19 |                HASH JOIN BUFFERED                    |                           |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,03 | PCWP | |
|  20 |                 BUFFER SORT                          |                           |       |       |            |          |       |       |  Q1,03 | PCWC | |
|  21 |                  PX RECEIVE                          |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |  Q1,03 | PCWP | |
|  22 |                   PX SEND HASH                       | :TQ10000                  |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        | S->P | HASH |
|  23 |                    NESTED LOOPS                      |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        |      | |
|  24 |                     NESTED LOOPS                     |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        |      | |
|  25 |                      TABLE ACCESS BY INDEX ROWID     | FIRM                      |     1 |    12 |     2   (0)| 00:00:01 |       |       |        |      | |
|* 26 |                       INDEX RANGE SCAN               | FIRM_ULT_PARENT_FIRM_IDX1 |     1 |       |     1   (0)| 00:00:01 |       |       |        |      | |
|  27 |                      PARTITION RANGE SINGLE          |                           |       |       |            |          |    30 |    30 |        |      | |
|  28 |                       BITMAP CONVERSION TO ROWIDS    |                           |       |       |            |          |       |       |        |      | |
|* 29 |                        BITMAP INDEX SINGLE VALUE     | LIT_FACT_BD_IDX09         |       |       |            |          |    30 |    30 |        |      | |
|* 30 |                     TABLE ACCESS BY LOCAL INDEX ROWID| LIT_FACT_BD               |   598 | 30498 |   136   (0)| 00:00:03 |    30 |    30 |        |      | |
|  31 |                 PX RECEIVE                           |                           |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,03 | PCWP | |
|  32 |                  PX SEND HASH                        | :TQ10001                  |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | P->P | HASH |
|  33 |                   PX BLOCK ITERATOR                  |                           |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | PCWC | |
|* 34 |                    TABLE ACCESS FULL                 | DATE_CODE                 |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | PCWP | |
|  35 |           PARTITION RANGE SINGLE                     |                           |     1 |    46 |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 36 |            TABLE ACCESS BY LOCAL INDEX ROWID         | LIT_FACT_BD               |     1 |    46 |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 37 |             INDEX RANGE SCAN                         | LIT_FACT_BD_IDX20         |     1 |       |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 38 |          INDEX UNIQUE SCAN                           | PK_FIRM                   |     1 |       |     0   (0)|          |       |       |  Q1,04 | PCWP | |
|  39 |         TABLE ACCESS BY INDEX ROWID                  | FIRM                      |     1 |    12 |     0   (0)|          |       |       |  Q1,04 | PCWP | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----

Note particularly that operations 15 and 16 tell us that we’ve forced the optimizer into treating the inline view as a separate query block and we now have two aggregation steps, one inside the view, and another after joining FIRM (again) and LIT_FACT_BD (again) to the inline view.

Unfortunately the plan shown here is pulled from memory using dbms_xplan.display_cursor() after execution, so it include the various parallel executoin colums (TQ, IN-OUT, PQ Distrib), but doesn’t have the rowsource execution stats enabled so we can’t see the actual workload and data volume, but in the words of the OP: “adding no_merge hint did the trick and now the SQL is just executing fine”.

Summary

The steps for solving the performance problems of a specific SQL statement are very fluid. For a long-running or parallel statement the SQL Monitor report will automatically be created (though there are limits on the size of the plan that may disable the feature) and this is the easiest source of useful information, though you might also need to pull the execution plan from v$sql_plan to get details about parallel execution and partitioning at the same time.

If you’re not licensed for the diagnostic and performance packs, though, enabling SQL Trace to get the plan and waits gets you a lot of infomation, and querying (g)v$pq_tqstat immediately after running the query can fill in the parallel traffic details.

In this example the SQL Monitor report showed a highly skewed distribution, which might have been fixable by changing the PQ distribution, or even by simply changing the degree of parallelism.

Further examination of the report showed that the query generated an enormous rowsource which it then crunched down to a small result set. Comparing the 2018 and 2019 plans (which were not quite identical but were sufficiently similar to justify comparison) the same skew and explosion of rowsource were visible in both, though the data size involved in the 2018 plan made it seem that the plan was a “good” one which really it wasn’t.

The obvious target for tuning was to see if the explosion in volume could be reduced or eliminated by writing the query with some intermediate non-mergeable view(s), and it turned out that the query had been written with that intent in its original form but without a hint to block complex view merging. After adding the hint the performance was acceptable.

Comments (3)

April 7, 2022

Parallel Inactivity

Filed under: Oracle,Parallel Execution,Performance,Troubleshooting — Jonathan Lewis @ 5:08 pm BST Apr 7,2022

This is a simple script that I wrote more than 20 years ago (for 8.1.7.4 apparently) to get the session wait event figures from parallel query slaves as they were running. I’m posting it now because there’s a question on the Oracle Database Forum where it would be useful as a way for checking for performance problems in a long-running CTAS.

Oracle gives us the view v$px_session to link together the query co-ordinator with the parallel query slaves, and it gives us v$px_sesstat to report the linked session activity stats, but it doesn’t gives us a view to link together the session events for the multiple sessions, so this query simply joins v$px_session to v$session_events to produce that result:

rem
rem     Script:         px_waits.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2001
rem     Purpose:        Report events for PX execution
rem
rem     Last tested:
rem             19.3.0.0
rem
rem     Notes:
rem     Note that the co-ordinator can be on a different instance
rem     in parallel server systems.  To cater for an oddity of the
rem     qcinst report, we need the current instance number in case
rem     the QC is on the current machine.
rem
rem     If the co-ordinator is present, then it appears last on the list
rem     for a server group.
rem

column instance_number new_value m_inst
select instance_number from v$instance;

set linesize 120
set pagesize 60
set trimspool on
set tab off
set verify off

spool px_waits

break -
        on qcsid skip 1 -
        on server_group -
        on degree -
        on server_set -
        on sid -
        on server# skip 1

column event format a32

column  qcsid           format a6       heading 'Coord'
column  server_group    format 999      heading 'Grp'
column  degree          format a5       heading 'Deg'   noprint
column  server_set      format 999      heading 'Set'   noprint
column  server#         format 999      heading 'Sno'
column  sid             format 9999     heading 'SID'
column  name            format a32      heading 'Statistic'
column  value           format 99,999,999       heading 'value'

select
        ss.qcsid || '/' || nvl(ss.qcinst_id,&m_inst)            qcsid,
        ss.server_group,
        decode(degree,
                null,null,
                ss.degree || '/' || ss.req_degree
        )       degree,
        ss.server_set,
        ss.server#,
        ss.sid,
        se.event,
        se.total_waits,
        se.time_waited
from
        v$px_session            ss,
        v$session_event         se
where
        se.sid = ss.sid
order by
        ss.qcsid,
        ss.server_group,
        ss.server_set,
        ss.server#,
        se.event
;

clear columns
clear breaks
set verify on

spool off

See also: Parallel Activity for the session activity stats report.

Comments (2)

Parallel Activity

Filed under: Oracle,Parallel Execution,Performance,Troubleshooting — Jonathan Lewis @ 4:49 pm BST Apr 7,2022

This is a simple script that I wrote more than 20 years ago (for 8.1.7.4 apparently) to get the session activity from parallel query slaves as they were running. I’m posting it now because there’s a question on the Oracle Database Forum where it would be useful as a way for checking for a skewed data distribution in a long-running query.

Oracle gives use the view v$px_sesstat which connects PX slaves with their query co-ordinator (QC) and reports the session activity for all of them. The view only report statistics numbersm though, so this little query joins the view to v$statname to report the names. It eliminates stats where the value is zero, and orders by QC (though you could add a line to restrict the query to a single QC), parallel server group, then process, then statistic number.

rem
rem     Script:         px_stats.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2001
rem     Purpose:        Report statistics for PX slaves on the fly
rem
rem     Last tested
rem             19.3.0.0
rem     Notes:
rem     Note that the co-ordinator can be on a different instance
rem     in parallel server systems.  To cater for an oddity of the
rem     qcinst report, we need the current instance number in case
rem     the QC is on the current machine.
rem
rem     If the co-ordinator is present, then it appears last on the list
rem     for a server group.
rem
rem     This code really needs to be enhanced to do a proper job on OPS/RAC
rem     by accessing gv$px_sesstat, and handling sids and instances correctly
rem
rem     The user running this query has to have SELECT privileges 
rem     on the views v$instance, v$statname and v$px_sesstat
rem

column instance_number new_value m_inst
select instance_number from v$instance;

spool px_stats

set tab off
set pagesize 60
set linesize 156
set trimspool on
set verify off

break -
        on qcsid skip 1 -
        on server_group -
        on degree -
        on server_set -
        on sid -
        on server# skip 1


column  qcsid           format a6               heading 'Coord'
column  server_group    format 999              heading 'Grp'
column  degree          format a5               heading 'Deg'
column  server_set      format 999              heading 'Set'
column  server#         format 999              heading 'Sno'
column  sid             format 9999             heading 'SID'
column  name            format a52              heading 'Statistic'
column  value           format 99,999,999,999   heading 'value'


select
        st.qcsid || '/' || nvl(st.qcinst_id,&m_inst)            qcsid,
        st.server_group,
        decode(degree,
                null,null,
                st.degree || '/' || st.req_degree
        )       degree,
        st.server_set,
        st.server#,
        st.sid,
        sn.name,
        st.value
from
        v$px_sesstat    st,
        v$statname      sn
where
        sn.statistic# = st.statistic#
and     st.value != 0
order by
        st.qcsid,
        st.server_group,
        st.server_set,
        st.server#,
        st.statistic#
;

clear columns
clear breaks
set verify on

spool off

See also: Parallel Inactivity for a simple script reporting wait events

Comments (2)

September 30, 2020

Serial Bloom

Filed under: Execution plans,Oracle,Parallel Execution,Performance,Upgrades — Jonathan Lewis @ 1:35 pm BST Sep 30,2020

Following the recent note I wrote about an enhancement to the optimizer’s use of Bloom filters, I received a question by email asking about the use of Bloom filters in serial execution plans:

I’m having difficulty understanding the point of a Bloom filter when used in conjunction with a hash join where everything happens within the same process.

I believe you mentioned in your book (Cost Based Oracle) that hash joins have a mechanism similar to a Bloom filter where a row from the probe table is checked against a bitmap, where each hash table bucket is indicated by a single bit. (You have a picture on page 327 of the hash join and bitmap, etc).

The way that bitmap looks and operates appears to be similar to a Bloom filter to me…. So it looks (to me) like hash joins have a sort of “Bloom filter” already built into them.

My question is… What is the advantage of adding a Bloom filter to a hash join if you already have a type of Bloom filter mechanism thingy built in to hash joins?

I can understand where it would make sense with parallel queries having to pass data from one process to another, but if everything happens within the same process I’m just curious where the benefit is.

The picture on page 327 of CBO-F is a variation on the following, which is the penultimate snapshot of the sequence of events in a multi-pass hash join. The key feature is the in-memory bitmap at the top of the image describing which buckets in the (partitioned and spilled) hash table hold rows from the build table. I believe that it is exactly this bitmap that is used as the Bloom filter.

The question of why it might be worth creating and using a Bloom filter in a simple serial hash join is really a question of scale. What is the marginal benefit of the Bloom filter when the basic hash join mechanism is doing all the hash arithmetic and comparing with a bitmap anyway?

If the hash join is running on an Exadata machine then the bitmap can be passed as a predicate to the cell servers and the hash function can be used at the cell server to minimise the volume of data that has to be passed back to the database server – with various optimisations dependent on the version of the Exadata software. Clearly minimising traffic through the interconnect is going to have some benefit.

Similarly, as the email suggests, for a parallel query where (typically) one set of parallel processes will read the probe table and distribute the data to the second set of parallel processes which then do the hash join it’s clearly sensible to allow the first set of procsses to apply the hash function and discard as many rows as possible before distributing the survivors – minimising inter-process communication.

In both these cases, of course, there’s a break point to consider of how effective the Bloom filter needs to be before it’s worth taking advantage of the technology. If the Bloom filter allows 99 rows out of every hundred to be passed to the database server / second set of parallel processes then Oracle has executed the hash function and checked the bitmap 100 times to avoid sending one row (and it will (may) have to do the same hash function and bitmap check again to perform the hash join); on the other hand if the Bloom filter discards 99 rows and leaves only one row surviving then that’s a lot of traffic eliminated – and that’s likely to be a good thing. This is why there are a few hidden parameters defining the boundaries of when Bloom filters should be used – in particular there’s a parameter “_bloom_filter_ratio” which defaults to 35 and is, I suspect, a figure which says something like “use Bloom filtering only if it’s expected to reduce the probe data to 35% of the unfiltered size”.

So the question then becomes: “how could you benefit from a serial Bloom filter when it’s the same process doing everything and there’s no “long distance” traffic going on between processes?” The answer is simply that we’re operating at a much smaller scale. I’ve written blog notes in the past where the performance of a query depends largely on the number of rows that are passed up a query plan before being eliminated (for example here, where the volume of data moving results in a significant fraction of the total time).

If you consider a very simple hash join its plan is going to be shaped something like this:


-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    45 |   720 |    31 |
|*  1 |  HASH JOIN         |      |    45 |   720 |    31 |
|*  2 |   TABLE ACCESS FULL| T2   |    15 |   120 |    15 |
|   3 |   TABLE ACCESS FULL| T1   |  3000 | 24000 |    15 |
-----------------------------------------------------------

If you read Tanel Poder’s article on execution plans as a tree of Oracle function calls you’ll appreciate that you could translate this into informal English along the lines of:

Operation 1 calls a function (at operation 2) to do a tablescan of t1 and return all the relevant rows, building an in-memory hash table by applying a hashing function to the join column(s) of each row returned by the call to the tablescan. As the hash table is populated the operation also constructs a bitmap to flag buckets in the hash table that have been populated.
Operation 1 then calls a function (at operation 3) to start a tablescan and then makes repeated calls for it to return one row (or, in newer versions, a small rowset) at a time from table t2. For each row returned operation 1 applies the same hash function to the join column(s) and checks the bitmap to see if there’s a potential matching row in the relevant bucket of the hash table, and if there’s a potential match Oracle examines the actual contents of the bucket (which will be stored as a linked list) to see if there’s an actual match.

Taking the figures above, let’s imagine that Oracle is using a rowset size of 30 rows. Operation 1 will have to make 100 calls to Operation 3 to get all the data, and call the hashing function 3,000 times. A key CPU component of the work done is that the function represented by operation 3 is called 100 times and (somehow) allocates and fills an array of 30 entries each time it is called.

Now assume operation 1 passes the bitmap to operation 3 as an input and it happens to be a perfect bitmap. Operation 3 starts its tablescan and will call the hash function 3,000 times, but at most 45 rows will get past the bitmap. So operation 1 will only have to call operation 3 twice. Admittedly operation 1 will (possibly) call the hash function again for each row – but maybe operation 3 will supply the hash value in the return array. Clearly there’s scope here for a trade-off between the reduction in work due to the smaller number of calls and the extra work needed to take advantage of the bitmap technology.

Here’s an example that shows the potential for savings – if you want to recreate this test you’ll need about 800MB of free space in the database, the first table takes about 300MB and the second about 450MB.


rem
rem     Script:         bloom_filter_serial_02.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,30,'0')             v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7 -- > comment to avoid WordPress format issue
;

create table t2
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        round(rownum + 0.5,2)           id,
        mod(rownum,1e5)                 n1,
        lpad(rownum,10)                 v1
from
        generator       v1,
        generator       v2
where
        rownum <= 1e7 -- > comment to avoid WordPress format issue
;


prompt  =================
prompt  With Bloom filter
prompt  =================

select 
        /*+ 
                px_join_filter(t1) 
                monitor
        */
        t1.v1, t2.v1
from 
        t2, t1
where 
        t2.n1 = 0
and 
        t1.id = t2.id
/

prompt  ===============
prompt  No Bloom filter
prompt  ===============

select 
        /*+
                monitor
        */
        t1.v1, t2.v1
from 
        t2, t1
where 
        t2.n1 = 0
and 
        t1.id = t2.id
/

I’ve created tables t1 and t2 with an id column that never quite matches, but the range of values is set so that the optimizer thinks the two tables might have a near-perfect 1 to 1 match. I’ve given t2 an extra column with 10⁵ distinct values in its 10⁷ rows, so it’s going to have 100 rows per distinct value. Then I’ve presented the optimizer with a query that looks as if it’s going to find 100 rows in t2 and needs to find a probable 100 rows of matches in t1. For my convenience, and to highlight a couple of details of Bloom filters, it’s not going to find any matches.

In both runs I’ve enabled the SQL Monitor feature with the /*+ monitor */ hint, and in the first run I’ve also hinted the use of a Bloom filter. Here are the resulting SQL Monitor outputs. Bear in mind we’re looking at a reasonably large scale query – volume of input data – with a small result set.

First without the Bloom filter:


Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|    3.00 |    2.24 |     0.77 |     1 |  96484 |  773 | 754MB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=2959412835)
==================================================================================================================================================
| Id |      Operation       | Name |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                      |      | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
==================================================================================================================================================
|  0 | SELECT STATEMENT     |      |         |       |         2 |     +2 |     1 |        0 |      |       |     . |          |                 |
|  1 |   HASH JOIN          |      |     100 | 14373 |         2 |     +2 |     1 |        0 |      |       |   2MB |          |                 |
|  2 |    TABLE ACCESS FULL | T2   |      99 |  5832 |         2 |     +1 |     1 |      100 |  310 | 301MB |     . |          |                 |
|  3 |    TABLE ACCESS FULL | T1   |     10M |  8140 |         2 |     +2 |     1 |      10M |  463 | 453MB |     . |          |                 |
==================================================================================================================================================

According to the Global Stats the query has taken 3 seconds to complete, of which 2.24 seconds is CPU. (The 750MB read in 0.77 second would be due to the fact that I’m running off SSD, and I’ve got a 1MB read size that helps). A very large fraction of the CPU appears because of the number of calls from operation 1 to operation 3 (the projection information pulled from memory reports a rowset size of 256 rows, so that’s roughly 40,000 calls to the function.

When we force the use of a Bloom filter the plan doesn’t change much (though the creation and use of the Bloom filter has to be reported) – but the numbers do change quite significantly.

Global Stats
================================================================
| Elapsed |   Cpu   |    IO    | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
================================================================
|    1.97 |    0.99 |     0.98 |     1 |  96484 |  773 | 754MB |
================================================================

SQL Plan Monitoring Details (Plan Hash Value=4148581417)
======================================================================================================================================================
| Id |       Operation       |  Name   |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity | Activity Detail |
|    |                       |         | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |   (# samples)   |
======================================================================================================================================================
|  0 | SELECT STATEMENT      |         |         |       |         1 |     +1 |     1 |        0 |      |       |     . |          |                 |
|  1 |   HASH JOIN           |         |     100 | 14373 |         1 |     +1 |     1 |        0 |      |       |   1MB |          |                 |
|  2 |    JOIN FILTER CREATE | :BF0000 |      99 |  5832 |         1 |     +1 |     1 |      100 |      |       |     . |          |                 |
|  3 |     TABLE ACCESS FULL | T2      |      99 |  5832 |         1 |     +1 |     1 |      100 |  310 | 301MB |     . |          |                 |
|  4 |    JOIN FILTER USE    | :BF0000 |     10M |  8140 |         1 |     +1 |     1 |    15102 |      |       |     . |          |                 |
|  5 |     TABLE ACCESS FULL | T1      |     10M |  8140 |         1 |     +1 |     1 |    15102 |  463 | 453MB |     . |          |                 |
======================================================================================================================================================

In this case, the elapsed time dropped to 1.97 seconds (depending on your viewpoint that’s either a drop of “only 1.03 seconds” or drop of “an amazing 34.3%”; with the CPU time dropping from 2.24 seconds to 0.99 seconds (55.8% drop!)

In this case you’ll notice that the tablescan of t1 produced only 15,102 rows to pass up to the hash join at operation 1 thanks to the application of the predicate (not reported here): filter(SYS_OP_BLOOM_FILTER(:BF0000,”T1″.”ID”)). Instead of 40,000 calls for the next rowset the hash function has been applied during the tablescan and operation 5 has exhausted the tablescan after only about 60 calls. This is what has given us the (relatively) significant saving in CPU.

This example of the use of a Bloom filter highlights up the two points I referred to earlier.

First, although we see operations 4 and 5 as Join (Bloom) filter use and Table access full respectively I don’t think the data from the tablescan is being “passed up” from operation 5 to 4; I believe operation 4 can be views as a “placeholder” in the plan to allow us to see the Bloom filter in action, the hashing and filtering actually happening during the tablescan.
Secondly, we know that there are ultimately no rows in the result set, yet the application of the Bloom filter has not eliminated all the data. Remember that the bitmap that Oracle constructs of the hash table identifies used buckets, not actual values. Those 15,102 rows are rows that “might” find a match in the hash table because they belong in buckets that are flagged. A Bloom filter won’t discard any data that is needed, but it might fail to eliminate data that subsequently turns out to be unwanted.

How parallel is parallel anyway?

I’ll leave you with one other thought. Here’s an execution plan from 12c (12.2.0.1) which joins three dimension tables to a fact table. There are 343,000 rows in the fact table and the three joins individually identify about 4 percent of the data in the table. In a proper data warehouse we might have been looking at a bitmap star transformation solution for this query, but in a mixed system we might want to run warehouse queries against normalised data – this plan shows what Bloom filters can do to minimise the workload. The plan was acquired from memory after enabling rowsource execution statistics:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |      1 |        |        |      |            |      1 |00:00:00.05 |      22 |      3 |       |       |          |
|   1 |  SORT AGGREGATE              |          |      1 |      1 |        |      |            |      1 |00:00:00.05 |      22 |      3 |       |       |          |
|   2 |   PX COORDINATOR             |          |      1 |        |        |      |            |      2 |00:00:00.05 |      22 |      3 | 73728 | 73728 |          |
|   3 |    PX SEND QC (RANDOM)       | :TQ10000 |      0 |      1 |  Q1,00 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   4 |     SORT AGGREGATE           |          |      2 |      1 |  Q1,00 | PCWP |            |      2 |00:00:00.09 |    6681 |   6036 |       |       |          |
|*  5 |      HASH JOIN               |          |      2 |     26 |  Q1,00 | PCWP |            |     27 |00:00:00.09 |    6681 |   6036 |  2171K|  2171K|     2/0/0|
|   6 |       JOIN FILTER CREATE     | :BF0000  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|*  7 |        TABLE ACCESS FULL     | T3       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|*  8 |       HASH JOIN              |          |      2 |    612 |  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6634 |   6026 |  2171K|  2171K|     2/0/0|
|   9 |        JOIN FILTER CREATE    | :BF0001  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 10 |         TABLE ACCESS FULL    | T2       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 11 |        HASH JOIN             |          |      2 |  14491 |  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6614 |   6022 |  2171K|  2171K|     2/0/0|
|  12 |         JOIN FILTER CREATE   | :BF0002  |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|* 13 |          TABLE ACCESS FULL   | T1       |      2 |      3 |  Q1,00 | PCWP |            |      6 |00:00:00.01 |      20 |      4 |       |       |          |
|  14 |         JOIN FILTER USE      | :BF0000  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  15 |          JOIN FILTER USE     | :BF0001  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  16 |           JOIN FILTER USE    | :BF0002  |      2 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|  17 |            PX BLOCK ITERATOR |          |      2 |    343K|  Q1,00 | PCWC |            |     27 |00:00:00.08 |    6594 |   6018 |       |       |          |
|* 18 |             TABLE ACCESS FULL| T4       |     48 |    343K|  Q1,00 | PCWP |            |     27 |00:00:00.05 |    6594 |   6018 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

It’s a parallel plan, but it’s used the 12c “PQ_REPLICATE” strategy. The optimizer has decided that all the dimension tables are so small that it’s going to allow every PX process to read every (dimension) table through the buffer cache and build its own hash tables from them. (In earlier versions you might have seen the query coordinator scanning and broadcasting the three small tables, or one set of PX processes scanning and broadcasting to the other set).

So every PX process has an in-memory hash table of all three dimension tables and then (operation 17) they start a tablescan of the fact table, picking non-overlapping rowid ranges to scan. But since they’ve each created three in-memory hash tables they’ve also been able to create three Bloom filters each, which can all be applied simultaneously as the tablescan takes place; so instead of 343,000 rows being passed up the plan and through the first hash join (where we see from operation 11 that the number of surviving rows would have been about 14,500 ) we see all but 27 rows discarded very early on in the processing. Like bitmap indexes part of the power of Bloom filters lies in the fact that with the right plan the optimizer can combine them and identify a very small data set very precisely, very early.

The other thing I want you to realise about this plan, though, is that it’s not really an “extreme” parallel plan. It’s effectively running as a set of concurrent, non-interfering, serial plans. Since I was running (parallel 2) Oracle started just 2 PX processes: they both built three hash tables from the three dimension tables then split the fact table in half and took half each to do all the joins, and passed the nearly complete result to the query co-ordinator at the last moment. That’s as close as you can get to two serial, non-interfering, queries and still call it a parallel query. So, if you wonder why there might be any benefit in serial Bloom filters – Oracle’s actually being benefiting from it under the covers for several years.

Summary

Bloom filters trade a decrease in messaging against an increase in preparation and hashing operations. For Exadata systems with predicate offloading it’s very easy to see the potential benefit; for general parallel execution; it’s also fairly easy to see the potential benefit for parallel query execution what inter-process message between two sets of PX processes can be resource intensive; but even for serial queries there can be some benefit though, in absolute terms, they are likely to be only a small saving in CPU.

Comments (1)

February 18, 2020

Interval Partition(s)

Filed under: Oracle,Parallel Execution,Partitioning — Jonathan Lewis @ 1:45 pm GMT Feb 18,2020

A quirky little feature of interval partitioning showed up on Twitter today – a parallel insert that would only use a single PX slave to do the inserting. With 1.1 billion rows and the option for running parallel 32 this made the loading process rather slower than it ought to have been.

Fortunately it’s quite easy to model (and work around) the oddity. So here’s a small data set and an empty partitioned table to work with:


rem
rem     Script:         pt_int_load_anomaly.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem

create table t1 
nologging 
as
select 
        ao.* 
from 
        (select * from all_objects where object_id is not null) ao,
        (select rownum id from dual connect by level <= 20)
;

create table pt1
partition  by range (object_id) interval (1000000) (
        partition p1 values less than (1)
)
as
select  * 
from    all_Objects
where   rownum = 0
/

I’ve created a table by copying all_objects 20 times which, for my little sandbox, has given me a total of about 1.2M rows in 12.2.0.1. Then I’ve created an empty interval-partitioned clone of all_objects, with the first partition defined to hold all rows where the object_id is less than 1 (and there’s no object in the database that could match that criterion). I’ve defined the interval to be 1,000,000 and since the highest object_id in my database is about 90,000 the first partition that gets added to this table will be able to hold all the data from t1.

So now we try to do a parallel insert from t1 into pt1, and check the execution plan and parallel execution statistics:


set serveroutput off

insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select * from t1;

select * from table(dbms_xplan.display_cursor);

start pq_tqstat

Note how I’ve used the hint /*+ enable_parallel_dml */ (possible a 12c hint back-ported to 11.2.0.4) rather than using an “alter session”, it’s just a little convenience to be able to embed the directive in the SQL. The pq_tqstat script is one I published some time ago to report the contents of the session-specific dynamic performance view v$pq_tqstat immediately after running a parallel statement.

Here’s the plan:


SQL_ID  25hub68pf1z1s, child number 0
-------------------------------------
insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select *
from t1

Plan hash value: 2888707464

-------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                            |          |       |       |   631 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                             |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)                       | :TQ10001 |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT (HIGH WATER MARK BROKERED)| PT1      |       |       |            |          |  Q1,01 | PCWP |            |
|   4 |     OPTIMIZER STATISTICS GATHERING          |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE                             |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANDOM LOCAL                  | :TQ10000 |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | P->P | RANDOM LOCA|
|   7 |        PX BLOCK ITERATOR                    |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL                   | T1       |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access(:Z>=:Z AND :Z<=:Z)

Note
-----
   - Degree of Parallelism is 6 because of hint

The most important detail of this plan is that the PX slaves do the load as select (operation 3), then send a message to the query coordinator (PX send QC, operation 2) to tell it about the data load. They do not send their data to the QC for the QC to do the load.

So the plan says we will be doing parallel DML, but here’s what v$pq_tqstat tells us:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P006                215880   34785363     17.47      16.86         16          0           0
                                             1 P007                202561   34436325     16.39      16.69         17          0           0
                                             1 P008                207519   34564496     16.79      16.75         17          0           0
                                             1 P009                208408   34594770     16.86      16.77         17          0           0
                                             1 P00A                198915   33529627     16.10      16.25         16          0           0
                                             1 P00B                202537   34430603     16.39      16.69         16          0           0
                      Consumer               1 P000                     0        144      0.00       0.00         51         47           0
                                             1 P001                     0        144      0.00       0.00         51         47           0
                                             1 P002               1235820  206340464    100.00     100.00         75         47           0
                                             1 P003                     0        144      0.00       0.00         51         47           0
                                             1 P004                     0        144      0.00       0.00       1138       1134           0
                                             1 P005                     0        144      0.00       0.00       1137       1133           0

                    1 Producer               1 P000                     0         24      0.00       5.91         51         42           0
                                             1 P001                     0         24      0.00       5.91         50         41           0
                                             1 P002                     2        286    100.00      70.44         58         14           0
                                             1 P003                     0         24      0.00       5.91         51         43           0
                                             1 P004                     0         24      0.00       5.91         51         42           0
                                             1 P005                     0         24      0.00       5.91         51         43           0
                      Consumer               1 QC                       2        406    100.00     100.00        311        179           0

19 rows selected.

The query did run parallel 6 as hinted – and 6 PX slaves scanned the t1 table; but they all sent all their data to one PX slave in the second slave set and that one PX slave did all the inserts. The plan was parallel, but the execution was effectively serial. (You’ll note there is something a little odd about the waits and timeout for p004 and p005 when they are acting as consumers. I may worry about that later, but it could be a host-based side effect of running parallel 6 on a VM with 4 CPUs).

The serialization leads to two questions

What went wrong?
How do we work around this and make the insert “truly” parallel

My answer to (1) is “I don’t know – but I’ll look at it if necessary” combined with the guess – it’s something to do with the table having only one partition at the outset and this has an unexpected side effect on the randomising function for the PX distribution.

My answer to (2) is “if I’m right about (1), why not try pre-defining two partitions, and I’ll even let both of them stay empty”.

So here’s my new definition for pt1:


create table pt1
partition  by range (object_id) interval (1000000) (
        partition p0 values less than (0),
        partition p1 values less than (1)
)
as
select  * 
from    all_Objects
where   rownum = 0
/

Re-running the test with the completely redundant, and permanently empty p0 partition the plan doesn’t change but the results from v$pq_tqstat change dramatically:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P006                207897   34581153     16.82      16.76         23          4           0
                                             1 P007                215669   34786429     17.45      16.86         30          5           0
                                             1 P008                221474   36749626     17.92      17.81         28          5           0
                                             1 P009                204959   34497164     16.58      16.72         22          2           0
                                             1 P00A                177755   30141002     14.38      14.61         21          0           0
                                             1 P00B                208066   35585810     16.84      17.25         25          2           0
                      Consumer               1 P000                213129   35612973     17.25      17.26         82         57           0
                                             1 P001                200516   33570586     16.23      16.27         84         55           0
                                             1 P002                203395   33950449     16.46      16.45         83         56           0
                                             1 P003                205458   34235575     16.63      16.59         82         54           0
                                             1 P004                204111   33999932     16.52      16.48        581        555           0
                                             1 P005                209211   34971669     16.93      16.95        580        553           0

                    1 Producer               1 P000                     2        286     16.67      16.67        422        149           0
                                             1 P001                     2        286     16.67      16.67        398        130           0
                                             1 P002                     2        286     16.67      16.67        405        128           0
                                             1 P003                     2        286     16.67      16.67        437        161           0
                                             1 P004                     2        286     16.67      16.67        406        116           0
                                             1 P005                     2        286     16.67      16.67        440        148           0
                      Consumer               1 QC                      12       1716    100.00     100.00        242        111           0



19 rows selected.

Every consumer receives and inserts roughly 200,000 rows – it’s a totally fair parallel DML. Timings are pretty irrelevant for such a small data set but the execution time did drop from 7 seconds to 4 seconds when parallelism was working “properly”.

I’ve tested this script on 12.2.0.1 and 19c (19.3 and 19.11) – the same anomaly appears in both versions though it might be worth noting that the strange skew in the waits and timeouts doesn’t appear in 19c.

Comments (2)

June 13, 2019

Parallel Fun – 2

Filed under: Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 6:24 pm BST Jun 13,2019

I started writing this note in March 2015 with the following introductory comment:

A little while ago I wrote a few notes about a very resource-intensive parallel query. One of the points I made about it was that it was easy to model, and then interesting to run on later versions of Oracle. So today I’m going to treat you to a few of the observations and notes I made after modelling the problem; and here’s the SQL to create the underlying objects:

Unfortunately I failed to do anything more with the model I had created until a few days ago (June 2019 – in case I stall again) when a related question came up on the ODC database forum. This time I’m ready to go a little further – so I’ll start with a bait-and-switch approach. Here are the first few lines (omitting the SQL) of an SQL Monitor report from an instance of 18.3 – is this a power-crazed machine or what ?


Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  1
 Session             :  TEST_USER (169:11324)
 SQL ID              :  73y5quma4jnw4
 SQL Execution ID    :  16777216
 Execution Started   :  06/13/2019 22:06:32
 First Refresh Time  :  06/13/2019 22:06:32
 Last Refresh Time   :  06/13/2019 22:07:03
 Duration            :  31s
 Module/Action       :  MyModule/MyAction
 Service             :  SYS$USERS
 Program             :  sqlplus@linux183.localdomain (TNS V1-V3)
 Fetch Calls         :  591

Global Stats
=========================================================================================
| Elapsed |   Cpu   |    IO    | Concurrency |  Other   | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs | Bytes |
=========================================================================================
|      14 |    3.18 |     0.00 |        0.05 |       11 |   591 |  25978 |   62 |  13MB |
=========================================================================================

Parallel Execution Details (DOP=3 , Servers Allocated=6730)
==========================================================================================

It didn’t take long to run the query, only about 31 seconds. But the thing to notice in the report is that while the DOP is reported as 3, the number of “Servers Allocated” is a massive 6,730. So the big question (before I show you more of the report, explain what’s happening, and supply the code to build the model) is this: how many PX processes did I actually start.

Here’s a little more of the output:


Parallel Execution Details (DOP=3 , Servers Allocated=6730)
==========================================================================================================================================================
|      Name      | Type  | Group# | Server# | Elapsed |   Cpu   |    IO    | Concurrency |  Other   | Buffer | Read | Read  |        Wait Events         |
|                |       |        |         | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) |  Gets  | Reqs | Bytes |         (sample #)         |
==========================================================================================================================================================
| PX Coordinator | QC    |        |         |      14 |    3.13 |          |        0.05 |       11 |  23727 |      |     . | PX Deq: Join ACK (5)       |
|                |       |        |         |         |         |          |             |          |        |      |       | PX Deq: Signal ACK EXT (2) |
|                |       |        |         |         |         |          |             |          |        |      |       | sql_id: 6405a2hc50bt4 (1)  |
| p004           | Set 1 |      1 |       1 |    0.00 |    0.00 |          |             |          |    180 |      |     . | library cache: mutex X (1) |
|                |       |        |         |         |         |          |             |          |        |      |       |                            |
| p005           | Set 1 |      1 |       2 |    0.00 |    0.00 |          |             |          |    100 |      |     . |                            |
| p006           | Set 1 |      1 |       3 |    0.00 |    0.00 |          |             |          |     90 |      |     . |                            |
| p000           | Set 1 |      2 |       1 |    0.01 |    0.01 |          |             |          |        |      |     . |                            |
| p001           | Set 1 |      2 |       2 |    0.02 |    0.02 |          |             |          |        |      |     . |                            |
| p002           | Set 2 |      2 |       1 |    0.01 |    0.01 |     0.00 |             |          |    944 |   32 |   7MB |                            |
| p003           | Set 2 |      2 |       2 |    0.01 |    0.01 |     0.00 |             |          |    937 |   30 |   7MB |                            |
==========================================================================================================================================================

Despite “allocating” 6,730 servers Oracle is only admitting to having used 7 of them -so let’s take a closer look at how they’re used. There are two groups and we have one set of 3 slaves in group 1, and two sets of two slaves in group 2. (It looks to me as if the Group# and Type columns should be the other way around given the hierarchy of group / type / server#). We can understand a little more of what these numbers mean if we look at the execution plan – particularly the special columns relating to Data Flow Operations (DFOs) and “DFO trees”.


SQL Plan Monitoring Details (Plan Hash Value=3398913290)
========================================================================================================================================================================
| Id |          Operation           |   Name   |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity |      Activity Detail       |
|    |                              |          | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |        (# samples)         |
========================================================================================================================================================================
|  0 | SELECT STATEMENT             |          |         |      |        32 |     +0 |     1 |     8846 |      |       |     . |     2.70 | Cpu (1)                    |
|  1 |   FILTER                     |          |         |      |        32 |     +0 |     1 |     8846 |      |       |     . |     5.41 | PX Deq: Signal ACK EXT (2) |
|  2 |    PX COORDINATOR            |          |         |      |        32 |     +0 |     5 |     8846 |      |       |     . |          |                            |
|  3 |     PX SEND QC (RANDOM)      | :TQ20002 |    9146 |  128 |        29 |     +2 |     2 |     8846 |      |       |     . |          |                            |
|  4 |      HASH JOIN BUFFERED      |          |    9146 |  128 |        29 |     +2 |     2 |     8846 |      |       |   9MB |          |                            |
|  5 |       PX RECEIVE             |          |    8846 |   11 |        14 |     +2 |     2 |     8846 |      |       |     . |          |                            |
|  6 |        PX SEND HYBRID HASH   | :TQ20000 |    8846 |   11 |         1 |     +0 |     2 |     8846 |      |       |     . |          |                            |
|  7 |         STATISTICS COLLECTOR |          |         |      |         1 |     +0 |     2 |     8846 |      |       |     . |          |                            |
|  8 |          PX BLOCK ITERATOR   |          |    8846 |   11 |         1 |     +0 |     2 |     8846 |      |       |     . |          |                            |
|  9 |           TABLE ACCESS FULL  | T2       |    8846 |   11 |         1 |     +0 |    23 |     8846 |   24 |   1MB |     . |          |                            |
| 10 |       PX RECEIVE             |          |   50000 |  116 |        14 |     +2 |     2 |     2509 |      |       |     . |          |                            |
| 11 |        PX SEND HYBRID HASH   | :TQ20001 |   50000 |  116 |         1 |     +0 |     2 |     2509 |      |       |     . |          |                            |
| 12 |         PX BLOCK ITERATOR    |          |   50000 |  116 |         1 |     +0 |     2 |     2509 |      |       |     . |          |                            |
| 13 |          TABLE ACCESS FULL   | T1       |   50000 |  116 |         1 |     +0 |    26 |     2509 |   38 |  12MB |     . |          |                            |
| 14 |    PX COORDINATOR            |          |         |      |        31 |     +1 |  8978 |     2252 |      |       |     . |    13.51 | PX Deq: Join ACK (5)       |
| 15 |     PX SEND QC (RANDOM)      | :TQ10000 |       1 |   77 |        32 |     +0 |  6667 |     3692 |      |       |     . |          |                            |
| 16 |      PX BLOCK ITERATOR       |          |       1 |   77 |        32 |     +0 |  6667 |    92478 |      |       |     . |     2.70 | Cpu (1)                    |
| 17 |       TABLE ACCESS FULL      | T3       |       1 |   77 |        32 |     +0 | 53118 |    92478 |   32 |   8MB |     . |    67.57 | Cpu (25)                   |
========================================================================================================================================================================

The “Name” column shows us that we have two DFO trees (:TQ2nnnn, and :TQ1nnnn) – this is why we see two “groups” in PX server detail, and why those groups can have difference degrees of parallelism.

Looking at the general shape of the plan you can see that operation 1 is a FILTER operation with two child operations, one at operation 2 the other at operation 14. So we probably have a filter subquery in place operated as DFO tree #1 while the main query is operated as DFO tree #2. This means the main query is running at DOP = 2 (it’s a hash join with hash distribution so it needs two sets of slave processes so all the details agree with what we’ve learned abaout Group# 2 above); and the subquery is operating at DOP = 3 – and it’s using only one set of slave processes.

There is a little anomaly in the number of Execs of operation 14 – at some point I will examine this more closely but it might simply be a reporting error that has added the number of Execs of its child operations to its own Execs, it might be something to do with counting in Exec calls by its parent, it might be a side effect of scalar subquery caching. I’ll worry about it when I have a good reason to do so. What I want to look at is the Execs of operations 15/16, the PX Block Iterator / PX Send QC. There are 6,667 reports of PX slave executing, and that matches up quite nicely with the 6,730 reported “Servers Allocated” – so it would appear that Oracle says it’s allocating a server whenever it uses a server. But does it really “allocate” (and, presumably, de-allocate).

Here’s how you find out – you run the query again taking various snapshot and looking for cross-references. I’ve got some results from v$pq_tqstat and v$pq_slave (deprecated in 23c, replaced by v$pq_server) for the run that produced the SQL Monitor report above, and some of the QC session stats and enqueue stats for a subsequent run. This is what we see:


select  process, count(*) 
from    v$pq_tqstat 
group by 
        process 
order by 
        process
;


PROCESS                    COUNT(*)
------------------------ ----------
P000                              3
P001                              3
P002                              2
P003                              2
P004                           2225
P005                           2214
P006                           2218
QC                             2243


SQL> select slave_name, sessions from V$pq_slave order by slave_name;

SLAV   SESSIONS
---- ----------
P000          1
P001          1
P002          1
P003          1
P004       2242
P005       2242
P006       2242

Key Session Stats
=================
Name                                                                         Value 
----                                                                         ----- 
opened cursors cumulative                                                    6,955
user calls                                                                  20,631
recursive calls                                                             20,895 
enqueue requests                                                            22,699 
enqueue conversions                                                         13,610 
enqueue releases                                                            15,894 
parse count (total)                                                          6,857 
execute count                                                                6,966 
DFO trees parallelized                                                           2
Parallel operations not downgraded                                           2,268

Key Enqueue Stats
=================
Type Short name                   Requests       Waits     Success      Failed    Wait m/s 
---- ----------                   --------       -----     -------      ------    -------- 
DA   Slave Process Array             2,272          13       2,272           0          43 
PS   contention                     27,160       1,586      27,080           7         415 
SE   contention                      6,784           0       6,785           0           0 

TYPE                 DESCRIPTION
-------------------- ------------------------------------------------------------------------
PS                   Parallel Execution Server Process reservation and synchronization
DA                   Slave Process Spawn reservation and synchronization
SE                   Lock used by transparent session migration

Oracle really did start and stop something like 6,700 PX sessions(constantly re-using the same small set of PX slave processes) for each execution of the filter subquery. This is definitely a performance threat – we keep acquiring and releasing PX slaves, we keep creating new sessions (yes, really), and we keep searching for cursors in the library cache. All these activities are highly contentious. If you start running multiple queries that do this sort of thing you find that you see increasing amounts of time being spent on latch contention, PX slave allocation, mutex waits, and all the other problems you get with sessions that log on, do virtually nothing, then log off in rapid succession.

So how do you write SQL that does this type of thing. Here’s my data model (you may want to limit the number of rows in the tables:

rem
rem     Script:         parallel_filter_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2019
rem

create table t1 as
select * from all_source;

create table t2 as
select * from all_source where mod(line,20) = 1;

create table t3 as
select * from all_source;

And here’s all you have to do to start creating problems – I’ve added explicit hints to force parallelism (particularly for the subquery), it’s more likely that it has been introduced accidentally by table or index definitions, or by an “alter session” to “force parallel”:


set feedback only

select
        /*+ 
                parallel(t1 2) 
                parallel(t2 2)
                leading(t1 t2)
                use_hash(t2)
                swap_join_inputs(t2)
                pq_distribute(t2 hash hash)
                cardinality(t1,50000)
        */
        t1.owner,
        t1.name,
        t1.type
from
        t1
join
        t2
on      t2.owner = t1.owner
and     t2.name = t1.name
and     t2.type = t1.type
where
        t1.line = 1
and     (
           mod(length(t1.text), 10) = 0
        or exists (
                select --+ parallel(t3 3) 
                        null
                from    t3
                where   t3.owner = t1.owner
                and     t3.name = t1.name
                and     t3.type = t1.type
                and     t3.line >= t1.line
                )
        )
;

set feedback on

I’ve written notes in the past about SQL that forces the optimizer to run subqueries as filter subqueries instead of unnesting them – this is just an example of that type of query, pushed into parallelism. It’s not the only way (see comment #1 from Dmitry Remizov below) to end up with scalar subqueries being executed many times as separate DFO trees even though Oracle has enhanced the optimizer several times over the years in ways that bypass the threat – but the probalm can still appear and it’s important to notice in development that you’ve got a query that Oracle can’t work around.

Update (July 2020)

In the comment below I’ve said that 19.3 seems to have some code to do better and my test run reported only 290 re-allocations per PX process for the subquery. However I’ve just been prompted to re-run the test on 19.3 and v$pq_tqstat reported 2,300 rows for each of the three PX processes and the SQL Monitor output reported “Servers Allocated=6979”. Perhaps the earlier test got lucky with data ordering and benefited from scalar subquery caching.

Comments (8)

March 1, 2019

Hash Optimisation-

Filed under: Oracle,Parallel Execution,Performance,trace files — Jonathan Lewis @ 4:39 pm GMT Mar 1,2019

Franck Pachot did an interesting presentation at the OBUG (Belgium user group) Tech Days showing how to use one of the O/S debug/trace tools to step through the function calls that Oracle made during different types of joins. This prompted me to ask him a question about a possible optimisation of hash joins as follows:

The hash join operation creates an in-memory hash table from the rowsource produced by its first child operation then probes the hash table with rows from the row source produced by the second child operation; but if there are no rows in the first row source then there’s no need to acquire rows from the second row source, so Oracle doesn’t call the second child operation.

Does the hash join adopt a “symmetrical” strategy – starting the second child operation prematurely to find the first couple of rows before executing the first child operation and building the in-memory hash table ?

Franck got back to me before the end of the event with the answer – which was “no”. (This makes sense, of course, because if there’s no data in the second rowsource – which is supposed to be the “big” rowsource – you’ve probably done a lot of work which you may not have needed to do if you’d checked the first rowsource properly first. (Thought – might Oracle look for the first row from the first row source, then check the second row source for a first row, then finish the first rowsource and build? How difficult do you want to make your life?)

So Franck got me an answer by looking at function calls – but I’d only asked him because I thought he might have done the experiment already and might be able to give me an answer immediately rather than having to waste his valuable Belgian Beer Drinking time to do some tests for me. (And I’ve never got around to playing with strace, ptrace, dtrace, truss, et. al. so I didn’t fancy doing the tracing for myself, of course). But this particular question doesn’t need any examination of function calls to get an answer: an extended trace file would be sufficient. Here’s how to start:


rem
rem     Script:         hash_join_opt.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2019
rem
rem     Last tested
rem             11.2.0.4
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,20)                  n20,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

create table t2
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,10)                  n10,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1'
        );
end;
/

set serveroutput off

alter system flush buffer_cache;
execute dbms_lock.sleep(2)

alter session set statistics_level = all;
alter session set events '10046 trace name context forever, level 8';

select
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7.5
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));
alter session set events '10046 trace name context off';
alter session set statistics_level = typical;

I’ve created a pair of tables with similar volumes of data and then executed a query that should do a hash join between the tables. The t2 table will return no rows since the predicate on n10 is asking for a value that doesn’t exist but the optimizer will think it will supply 10% of the table since I’ve avoided creating a histogram on the column and the num_distinct for the n10 column is 10. You’ll notice that I’ve flushed the buffer_cache (with a couple of seconds pause, just in case) before executing the query. Here’s the execution plan with rowsource execution stats:


SQL_ID  bscu1r7cu36ur, child number 0
-------------------------------------
select  t1.v1, t2.v1 from  t1, t2 where  t1.n20 = 19 and t2.id = t1.id
and t2.n10 = 7.5

Plan hash value: 1838229974

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      0 |00:00:00.04 |     352 |    348 |       |       |          |
|*  1 |  HASH JOIN         |      |      1 |    500 |      0 |00:00:00.04 |     352 |    348 |  1519K|  1519K| 1487K (0)|
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    500 |    500 |00:00:00.03 |     176 |    174 |       |       |          |
|*  3 |   TABLE ACCESS FULL| T2   |      1 |   1000 |      0 |00:00:00.01 |     176 |    174 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."ID"="T1"."ID")
   2 - filter("T1"."N20"=19)
   3 - filter("T2"."N10"=7.5)

As you can see, the tablescan of t2 returned no rows so the hash join returned no rows, but the stats show 174 blocks read from both t1 and t2. The fact that we read 174 blocks from t2 doesn’t mean we didn’t do a pre-emptive check, of course, as those reads could (by the hypothesis) have been the reads required by the pre-emptive check. The fact that we did a complete scan both tables, though, tells us that the reads on t2 (probably) weren’t from a pre-emptive check (if they were then Oracle shouldn’t have scanned the whole of t1).

We can get extra corroborative evidence though when we check the trace file – which shows us the following waits:

PARSING IN CURSOR #139927916187824 len=88 dep=0 uid=62 oct=3 lid=62 tim=1551445144496821 hv=3651246935 ad='97353e10' sqlid='bscu1r7cu36ur'
select
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7.5
END OF STMT

PARSE #139927916187824:c=3000,e=2760,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=1838229974,tim=1551445144496811
EXEC #139927916187824:c=0,e=78,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144497155
WAIT #139927916187824: nam='SQL*Net message to client' ela= 8 driver id=1650815232 #bytes=1 p3=0 obj#=351199 tim=1551445144497224

WAIT #139927916187824: nam='db file sequential read' ela= 634 file#=5 block#=130 blocks=1 obj#=351198 tim=1551445144498386
WAIT #139927916187824: nam='db file scattered read' ela= 397 file#=5 block#=131 blocks=5 obj#=351198 tim=1551445144498971
WAIT #139927916187824: nam='db file scattered read' ela= 1752 file#=5 block#=136 blocks=8 obj#=351198 tim=1551445144502411
WAIT #139927916187824: nam='db file scattered read' ela= 426 file#=5 block#=145 blocks=7 obj#=351198 tim=1551445144504056
WAIT #139927916187824: nam='db file scattered read' ela= 590 file#=5 block#=152 blocks=8 obj#=351198 tim=1551445144505655
WAIT #139927916187824: nam='db file scattered read' ela= 8063 file#=5 block#=161 blocks=7 obj#=351198 tim=1551445144514574
WAIT #139927916187824: nam='db file scattered read' ela= 402 file#=5 block#=168 blocks=8 obj#=351198 tim=1551445144515741
WAIT #139927916187824: nam='db file scattered read' ela= 803 file#=5 block#=177 blocks=7 obj#=351198 tim=1551445144517495
WAIT #139927916187824: nam='db file scattered read' ela= 423 file#=5 block#=184 blocks=8 obj#=351198 tim=1551445144518743
WAIT #139927916187824: nam='db file scattered read' ela= 422 file#=5 block#=193 blocks=7 obj#=351198 tim=1551445144520696
WAIT #139927916187824: nam='db file scattered read' ela= 266 file#=5 block#=200 blocks=8 obj#=351198 tim=1551445144521460
WAIT #139927916187824: nam='db file scattered read' ela= 307 file#=5 block#=209 blocks=7 obj#=351198 tim=1551445144522249
WAIT #139927916187824: nam='db file scattered read' ela= 313 file#=5 block#=216 blocks=8 obj#=351198 tim=1551445144523166
WAIT #139927916187824: nam='db file scattered read' ela= 247 file#=5 block#=225 blocks=7 obj#=351198 tim=1551445144523927
WAIT #139927916187824: nam='db file scattered read' ela= 315 file#=5 block#=232 blocks=8 obj#=351198 tim=1551445144524646
WAIT #139927916187824: nam='db file scattered read' ela= 326 file#=5 block#=241 blocks=7 obj#=351198 tim=1551445144525565
WAIT #139927916187824: nam='db file scattered read' ela= 186 file#=5 block#=248 blocks=8 obj#=351198 tim=1551445144526255
WAIT #139927916187824: nam='db file scattered read' ela= 341 file#=5 block#=258 blocks=55 obj#=351198 tim=1551445144527363
----------------
WAIT #139927916187824: nam='db file sequential read' ela= 201 file#=5 block#=386 blocks=1 obj#=351199 tim=1551445144531165
WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=387 blocks=5 obj#=351199 tim=1551445144531492
WAIT #139927916187824: nam='db file scattered read' ela= 194 file#=5 block#=392 blocks=8 obj#=351199 tim=1551445144531750
WAIT #139927916187824: nam='db file scattered read' ela= 301 file#=5 block#=401 blocks=7 obj#=351199 tim=1551445144532271
WAIT #139927916187824: nam='db file scattered read' ela= 272 file#=5 block#=408 blocks=8 obj#=351199 tim=1551445144532756
WAIT #139927916187824: nam='db file scattered read' ela= 258 file#=5 block#=417 blocks=7 obj#=351199 tim=1551445144533218
WAIT #139927916187824: nam='db file scattered read' ela= 242 file#=5 block#=424 blocks=8 obj#=351199 tim=1551445144533704
WAIT #139927916187824: nam='db file scattered read' ela= 232 file#=5 block#=433 blocks=7 obj#=351199 tim=1551445144534125
WAIT #139927916187824: nam='db file scattered read' ela= 213 file#=5 block#=440 blocks=8 obj#=351199 tim=1551445144534506
WAIT #139927916187824: nam='db file scattered read' ela= 241 file#=5 block#=449 blocks=7 obj#=351199 tim=1551445144534914
WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=456 blocks=8 obj#=351199 tim=1551445144535258
WAIT #139927916187824: nam='db file scattered read' ela= 311 file#=5 block#=465 blocks=7 obj#=351199 tim=1551445144536518
WAIT #139927916187824: nam='db file scattered read' ela= 215 file#=5 block#=472 blocks=8 obj#=351199 tim=1551445144536906
WAIT #139927916187824: nam='db file scattered read' ela= 164 file#=5 block#=481 blocks=7 obj#=351199 tim=1551445144537343
WAIT #139927916187824: nam='db file scattered read' ela= 160 file#=5 block#=488 blocks=8 obj#=351199 tim=1551445144537624
WAIT #139927916187824: nam='db file scattered read' ela= 253 file#=5 block#=497 blocks=7 obj#=351199 tim=1551445144538041
WAIT #139927916187824: nam='db file scattered read' ela= 252 file#=5 block#=504 blocks=8 obj#=351199 tim=1551445144538456
WAIT #139927916187824: nam='db file scattered read' ela= 947 file#=5 block#=514 blocks=55 obj#=351199 tim=1551445144539690

FETCH #139927916187824:c=25996,e=43227,p=348,cr=352,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144540502

I’ve inserted a few gaps into this section of the trace file but haven’t deleted any lines. As you can see there’s a series of reads for obj# 351198 (table t1), followed by a series of reads of obj# 351199. We don’t do anything subtle like peeking at t1, switching to peek at t2, then continuing with t1: it’s a simple end to end brute force scan of each table in turn.

But there’s more …

Before I created the test above I decided to check whether I had already done the test some time in the past and written about it. A search through my blog didn’t turn up any notes about this question, but it did uncover the following footnote to an earlier posting about hash joins:

taking a closer look at the set of trace files generated in the broadcast test I discovered that the first set of slaves start their parallel tablescan of t1 but stop after just one read from each slave, then the second set of slaves scans and builds the hash table from t2 before calling for further data from t1.

Naturally I included a link to the footnote in the draft notes for this blog entry and promptly forgot about writing the blog note until a couple of days ago when Chinar Aliyev produced a note on the hash join buffered operation in which he had included an execution plan similar to the broadcast plan from my earlier blog note – so I thought I’d take another look at it, because it looks like some versions of the parallel hash join can do exactly the pre-emptive test that the serial execution plan doesn’t.

So, here’s a query to run against the same data set, and the resulting execution plan – pulled from memory after executing the query:


select
        /*+
                parallel(2)
                leading(t1 t2)
                use_hash(t2)
                pq_distribute(t2 none broadcast)
        */
        t1.v1, t2.v1
--      t1.v1, count(*)
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7
;

----------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
----------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |        |      |            |      0 |00:00:00.22 |      10 |      2 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |        |      |            |      0 |00:00:00.22 |      10 |      2 |       |       |          |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 |      0 |    500 |  Q1,01 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN             |          |      1 |    500 |  Q1,01 | PCWP |            |      0 |00:00:00.12 |     128 |    173 |  1483K|  1483K|     2/0/0|
|   4 |     PX BLOCK ITERATOR    |          |      2 |    500 |  Q1,01 | PCWC |            |    500 |00:00:00.03 |     248 |    173 |       |       |          |
|*  5 |      TABLE ACCESS FULL   | T1       |     25 |    500 |  Q1,01 | PCWP |            |    500 |00:00:00.02 |     248 |    173 |       |       |          |
|   6 |     BUFFER SORT          |          |      2 |        |  Q1,01 | PCWC |            |   2000 |00:00:00.07 |       0 |      0 |   108K|   108K|     2/0/0|
|   7 |      PX RECEIVE          |          |      2 |   1000 |  Q1,01 | PCWP |            |   2000 |00:00:00.06 |       0 |      0 |       |       |          |
|   8 |       PX SEND BROADCAST  | :TQ10000 |      0 |   1000 |  Q1,00 | P->P | BROADCAST  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |        PX BLOCK ITERATOR |          |      2 |   1000 |  Q1,00 | PCWC |            |   1000 |00:00:00.05 |     248 |    173 |       |       |          |
|* 10 |         TABLE ACCESS FULL| T2       |     25 |   1000 |  Q1,00 | PCWP |            |   1000 |00:00:00.04 |     248 |    173 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Now the rule for reading a DFO tree in a parallel plan is to follow the table queues (a.k.a. DFO numbers / TQ numbers) – but in this case we have a bit of a mixture as the table scan of t1 isn’t used as the input to a distribution so it’s hard to know whether the first physical activity is supposed to be the tablescan of t2 (feeding the zeroth table queue) or the tablescan of t1 which doesn’t feed a table queue until after the hash join.

Given that t2 is being broadcast we should probably expect to see the first set of parallel query slaves starting first with a scan of “random” sections of t1 to build an in-memory hash table, and then see the second set of parallel query slaves scanning t2 and broadcasting the results to the first set of slaves (every slave gets a copy of every row) to do the probe. We can try to confirm this by looking at the trace files. First we extract “db file scattered read” lines from the four trace files and then we sort them by timestamp.

test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 677 file#=5 block#=220 blocks=4 obj#=351218 tim=1551450113700076
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 214 file#=5 block#=225 blocks=3 obj#=351218 tim=1551450113701131
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 238 file#=5 block#=161 blocks=7 obj#=351218 tim=1551450113702026
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 318 file#=5 block#=177 blocks=6 obj#=351218 tim=1551450113703464
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 371 file#=5 block#=190 blocks=2 obj#=351218 tim=1551450113705168
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 227 file#=5 block#=193 blocks=5 obj#=351218 tim=1551450113705690
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 225 file#=5 block#=138 blocks=6 obj#=351218 tim=1551450113706850
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 487 file#=5 block#=131 blocks=5 obj#=351218 tim=1551450113708512
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 228 file#=5 block#=136 blocks=2 obj#=351218 tim=1551450113709161
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 219 file#=5 block#=294 blocks=7 obj#=351218 tim=1551450113710099
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 150 file#=5 block#=184 blocks=6 obj#=351218 tim=1551450113711398
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 229 file#=5 block#=273 blocks=7 obj#=351218 tim=1551450113712456
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 223 file#=5 block#=280 blocks=7 obj#=351218 tim=1551450113713829
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 253 file#=5 block#=168 blocks=7 obj#=351218 tim=1551450113715124
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 1014 file#=5 block#=308 blocks=5 obj#=351218 tim=1551450113717046
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 266 file#=5 block#=235 blocks=5 obj#=351218 tim=1551450113718684
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 208 file#=5 block#=241 blocks=2 obj#=351218 tim=1551450113719321
----
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 326 file#=5 block#=287 blocks=7 obj#=351218 tim=1551450113700003
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 147 file#=5 block#=198 blocks=2 obj#=351218 tim=1551450113701394
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 72 file#=5 block#=200 blocks=5 obj#=351218 tim=1551450113701840
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 459 file#=5 block#=153 blocks=7 obj#=351218 tim=1551450113703442
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 308 file#=5 block#=259 blocks=7 obj#=351218 tim=1551450113705132
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 313 file#=5 block#=266 blocks=7 obj#=351218 tim=1551450113706540
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 304 file#=5 block#=243 blocks=5 obj#=351218 tim=1551450113707925
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 219 file#=5 block#=248 blocks=2 obj#=351218 tim=1551450113708505
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 590 file#=5 block#=228 blocks=4 obj#=351218 tim=1551450113709705
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 178 file#=5 block#=232 blocks=3 obj#=351218 tim=1551450113710270
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 142 file#=5 block#=205 blocks=3 obj#=351218 tim=1551450113711046
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 98 file#=5 block#=209 blocks=4 obj#=351218 tim=1551450113711457
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 179 file#=5 block#=213 blocks=3 obj#=351218 tim=1551450113712308
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 296 file#=5 block#=216 blocks=4 obj#=351218 tim=1551450113712967
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 188 file#=5 block#=301 blocks=7 obj#=351218 tim=1551450113714238
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 207 file#=5 block#=250 blocks=6 obj#=351218 tim=1551450113715393
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 567 file#=5 block#=146 blocks=6 obj#=351218 tim=1551450113717686
----
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 196 file#=5 block#=522 blocks=7 obj#=351219 tim=1551450113722013
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 145 file#=5 block#=564 blocks=5 obj#=351219 tim=1551450113723239
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 317 file#=5 block#=491 blocks=5 obj#=351219 tim=1551450113724578
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 128 file#=5 block#=497 blocks=2 obj#=351219 tim=1551450113725217
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 495 file#=5 block#=402 blocks=6 obj#=351219 tim=1551450113726250
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 168 file#=5 block#=394 blocks=6 obj#=351219 tim=1551450113727928
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 249 file#=5 block#=424 blocks=7 obj#=351219 tim=1551450113730384
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 120 file#=5 block#=476 blocks=4 obj#=351219 tim=1551450113731225
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 485 file#=5 block#=481 blocks=3 obj#=351219 tim=1551450113732064
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 594 file#=5 block#=387 blocks=5 obj#=351219 tim=1551450113734168
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 323 file#=5 block#=392 blocks=2 obj#=351219 tim=1551450113735040
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 121 file#=5 block#=506 blocks=6 obj#=351219 tim=1551450113736253
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 285 file#=5 block#=440 blocks=6 obj#=351219 tim=1551450113739084
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 345 file#=5 block#=433 blocks=6 obj#=351219 tim=1551450113740940
----
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 155 file#=5 block#=484 blocks=4 obj#=351219 tim=1551450113721738
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 131 file#=5 block#=488 blocks=3 obj#=351219 tim=1551450113722260
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 143 file#=5 block#=557 blocks=7 obj#=351219 tim=1551450113723245
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 138 file#=5 block#=454 blocks=2 obj#=351219 tim=1551450113724545
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 124 file#=5 block#=456 blocks=5 obj#=351219 tim=1551450113724954
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 243 file#=5 block#=469 blocks=3 obj#=351219 tim=1551450113725947
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 523 file#=5 block#=472 blocks=4 obj#=351219 tim=1551450113726745
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 178 file#=5 block#=409 blocks=7 obj#=351219 tim=1551450113727781
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 230 file#=5 block#=543 blocks=7 obj#=351219 tim=1551450113729470
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 488 file#=5 block#=515 blocks=7 obj#=351219 tim=1551450113731618
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 356 file#=5 block#=461 blocks=3 obj#=351219 tim=1551450113733840
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 185 file#=5 block#=465 blocks=4 obj#=351219 tim=1551450113735195
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 284 file#=5 block#=536 blocks=7 obj#=351219 tim=1551450113736172
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 204 file#=5 block#=499 blocks=5 obj#=351219 tim=1551450113737957
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 306 file#=5 block#=504 blocks=2 obj#=351219 tim=1551450113738871
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 281 file#=5 block#=446 blocks=2 obj#=351219 tim=1551450113739710
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 158 file#=5 block#=449 blocks=5 obj#=351219 tim=1551450113740159
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 99 file#=5 block#=417 blocks=7 obj#=351219 tim=1551450113741709


tim=1551450113698677 obj#=351219
tim=1551450113698687 obj#=351219
----
tim=1551450113700003 obj#=351218
tim=1551450113700076 obj#=351218
tim=1551450113701131 obj#=351218
tim=1551450113701394 obj#=351218
tim=1551450113701840 obj#=351218
tim=1551450113702026 obj#=351218
tim=1551450113703442 obj#=351218
tim=1551450113703464 obj#=351218
tim=1551450113705132 obj#=351218
tim=1551450113705168 obj#=351218
tim=1551450113705690 obj#=351218
tim=1551450113706540 obj#=351218
tim=1551450113706850 obj#=351218
tim=1551450113707925 obj#=351218
tim=1551450113708505 obj#=351218
tim=1551450113708512 obj#=351218
tim=1551450113709161 obj#=351218
tim=1551450113709705 obj#=351218
tim=1551450113710099 obj#=351218
tim=1551450113710270 obj#=351218
tim=1551450113711046 obj#=351218
tim=1551450113711398 obj#=351218
tim=1551450113711457 obj#=351218
tim=1551450113712308 obj#=351218
tim=1551450113712456 obj#=351218
tim=1551450113712967 obj#=351218
tim=1551450113713829 obj#=351218
tim=1551450113714238 obj#=351218
tim=1551450113715124 obj#=351218
tim=1551450113715393 obj#=351218
tim=1551450113717046 obj#=351218
tim=1551450113717686 obj#=351218
tim=1551450113718684 obj#=351218
tim=1551450113719321 obj#=351218
--
tim=1551450113721738 obj#=351219
tim=1551450113722013 obj#=351219
tim=1551450113722260 obj#=351219
tim=1551450113723239 obj#=351219
tim=1551450113723245 obj#=351219
tim=1551450113724545 obj#=351219
tim=1551450113724578 obj#=351219
tim=1551450113724954 obj#=351219
tim=1551450113725217 obj#=351219
tim=1551450113725947 obj#=351219
tim=1551450113726250 obj#=351219
tim=1551450113726745 obj#=351219
tim=1551450113727781 obj#=351219
tim=1551450113727928 obj#=351219
tim=1551450113729470 obj#=351219
tim=1551450113730384 obj#=351219
tim=1551450113731225 obj#=351219
tim=1551450113731618 obj#=351219
tim=1551450113732064 obj#=351219
tim=1551450113733840 obj#=351219
tim=1551450113734168 obj#=351219
tim=1551450113735040 obj#=351219
tim=1551450113735195 obj#=351219
tim=1551450113736172 obj#=351219
tim=1551450113736253 obj#=351219
tim=1551450113737957 obj#=351219
tim=1551450113738871 obj#=351219
tim=1551450113739084 obj#=351219
tim=1551450113739710 obj#=351219
tim=1551450113740159 obj#=351219
tim=1551450113740940 obj#=351219
tim=1551450113741709 obj#=351219

I’ve got four slave processes p000 through to p0003, and in the top part of the output you can see that p000/p001 scan object 351218 (after several experiments the object_id for t1 has climbed a bit), and p002/p003 scan object 351219 (t2). Sorting by the timestamps (tim=), though, we can see that there are two reads for t2 (351219) followed by many reads of t1 (351218), finishing with a load of reads of t2. Cross-checking the timestamps on the sorted output with the original extract we can see that the two initial reads of t2 are the first reads by p002 and p003 respectively.

tim=1551450113698677 obj#=351219
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677

tim=1551450113698687 obj#=351219
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687

It really does look as if at run time Oracle does the scan and build for t1 first but starts with a quick peek at t2 to check that it’s going to return some data. We can be a little more fussy with this test – let’s change some of the data in t2 so that Oracle doesn’t find any rows in the first two scattered reads and see if anything changes.

This is going to take a little fiddling but in my case I can see that the first reads by slaves p002/p003 are looking at file 5, blocks 529 – 535 and blocks 550 – 556; and broadly speaking the scattered reads are working backwards down the table – so I’m going to update a load of data that appears in the second half of the table on the next run of the test with a simple command to put the n10 column value out of range:


update t2 set n10 = 0 where n10 = 7 and id > 750;
commit;

-- test the query

update t2 set n10 = 0 where n10 = 7;
commit;

-- test the query

After the first update the pattern of reads (in time order) changed do:

22 scattered reads of t2
34 scattered reads of t1
12 scattered reads of t2

This really does look like Oracle checking whether t2 will have any data before doing the full scan and build from t1.

Unfortunately when I did the final update that eliminated all the interesting data from t2 the pattern (in time order) turned into:

34 scattered reads of t2
34 scattered reads of t1

In other words – having discovered NO data in t2 that could join Oracle still scanned and built the hash table from t1. So maybe there’s a reason other than a sanity check for peeking at the second table, or maybe there’s a bug in the extreme case where the second table produces no data at all. (For reference you might also want to see an older note on how a parallel hash join doesn’t skip the probe table when the build table is empty.)

Just as a quick check that I wasn’t chasing an old bug at this point I repeated the tests on 18.3.0.0 – the same patterns emerged.

Comments (1)

January 18, 2019

DML Tablescans

Filed under: deadlocks,Infrastructure,Locks,Oracle,Parallel Execution,Performance — Jonathan Lewis @ 1:25 pm GMT Jan 18,2019

This note is a follow-up to a recent comment on a blog note about Row Migration:

So I wonder what is the difference between the two, parallel dml and serial dml with parallel scan, which makes them behave differently while working with migrated rows. Why might the strategy of serial dml with parallel scan case not work in parallel dml case? I am going to make a service request to get some clarifications but maybe I miss something obvious?

The comment also referenced a couple of MOS notes:

Bug 17264297 “Serial DML with Parallel scan performs single block reads during full table scan when table has chained rows in 11.2”
Doc ID 1514011.1 “Performance decrease for parallel DML on compressed tables or regular tables after 11.2 Upgrade

The latter document included a comment to the effect that 11.2 uses a “Head Piece Scan” while 11.1 uses a “First Piece scan”, which is a rather helpful comment. Conveniently my blog note referenced an earlier note on the potential for differentiating between migrated and chained rows through a “flag” byte associated with each row piece. The flag byte has an H bit for the row head piece, an F bit for the row first piece, and an L bit for the row last piece and {no bits set} for a row piece in the middle of a chained row.

Side note:

- - A “typical” simple row will be a single row-piece with the H, F and L bits all set.
  - A simple migrated row will start with an “empty” row-piece in one block with the H bit set and a pointer (nrid – next rowid) to a row in another block that will have the F and L bits set and a pointer (hrid – head rowid) back to the head piece.
  - A chained row could start with a row piece holding a few columns and the H and F bits set and a pointer to the next row piece which might lead to a long chain of row pieces with no bits set each pointing to the next row piece until you get to a row piece with the L bit set.
  - Alternatively you might have row which had migrated and chained – which means it could start with an empty row piece with just the H bit and a pointer to the next row piece, then a row piece with the F bit set, a back pointer to the header, and a next pointer to the next row piece, which could lead to a long chain of row pieces with no bits set until you reach a row piece with the L bit set.

Combining the comments about “head piece” and “first piece” scans with the general principles of DML and locking it’s now possible to start making some guesses about why the Oracle developers might want updates through tablescans to behave differently for serial and parallel tablescans. There are two performance targets to consider:

How to minimise random (single block) I/O requests
How to minimise the risk of deadlock between PX server processes.

Assume you’re doing a serial tablescan to find rows to update – assume for simplicity that there are no chained rows in the table. When you hit a migrated row (H bit only) you could follow the next rowid pointer (nrid) to find and examine the row. If you find that it’s a row that doesn’t need to be updated you’ve just done a completely redundant single block read; so it makes sense to ignore row pieces which are “H”-only row pieces and do a table scan based on “F” pieces (which will be FL “whole row” pieces thanks to our assumption of no chained rows). If you find a row which is an F row and it needs to be updated then you can do a single block read using the head rowid pointer (hrid) to lock the head row piece then lock the current row piece and update it; you only do the extra single block read for rows that need updates, not for all migrated rows. So this is (I guess) the “First Piece Scan” referenced in Doc ID 1514011.1. (And, conversely, if you scan the table looking only for row pieces with the H flag set this is probably the “Head Piece Scan”).

But there’s a potential problem with this strategy if the update is a parallel update. Imagine parallel server process p000 is scanning the first megabyte of a table and process p001 is scanning the second megabyte using the “first piece” algorithm. What happens if p001 finds a migrated row (flags = FL) that needs to be updated and follows its head pointer back into a block in the megabyte being scanned by p000? What if p000 has been busy updating rows in that block and there are no free ITLs for p001 to acquire to lock the head row piece? You have the potential for an indefinite deadlock.

On the other hand, if the scan is using the “head piece” algorithm p000 would have found the migrated row’s head piece and followed the next rowid pointer into a block in the megabyte being scanned by p001. If the row needs to be updated p000 can lock the head piece and the migrated piece.

At this point you might think that the two situations are symmetrical – aren’t you just as likely to get a deadlock because p000 now wants an ITL entry in a block that p001 might have been updating? Statistically the answer is “probably not”. When you do lots of updates it is possible for many rows to migrate OUT of a block; it is much less likely that you will see many rows migrate INTO a specific block. This means that in a parallel environment you’re more likely to see several PX servers all trying to acquire ITL entries in the same originating block than you are to see several PX servers trying to acquire ITL entries in the same destination block. There’s also the feature that when a row (piece) migrates into a block Oracle adds an entry to the ITL list if the number of inwards migrated pieces is more than the current number of ITL entries.

Conclusion

It’s all guesswork of course, but I’d say that for a serial update by tablescan Oracle uses the “first piece scan” to minimise random I/O requests while for a parallel update by tablescan Oracle uses the “head piece scan” to minimise the risk of deadlocks – even though this is likely to increase the number of random (single block) reads.

Finally (to avoid ambiguity) if you’ve done an update which does a parallel tablescan but a serial update (by passing rowids to the query co-ordinator) then I’d hope that Oracle would use the “first piece scan” for the parallel tablescan because there’s no risk of deadlock when only the query co-ordinator is the only process doing the locking and updating, which makes it safe to use the minimum I/O strategy. (And a paralle query with serial update happens quite frequently because people forget to enable parallel dml.)

Footnote

While messing around to see what happened with updates and rows that were both migrated and chained I ran the following script to create one nasty row. so that I could dump a few table blocks to check for ITLs, pointers, and locks. The aim was to get a row with a head-only piece (“H” bit), an F-only piece, a piece with no bits set, then an L-only piece. With an 8KB block size and 4,000 byte maximum for varchar2() this is what I did:


rem
rem     Script:         migrated_lock.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2019
rem     Purpose:
rem
rem     Last tested
rem             18.3.0.0
rem

create table t1 (
        n1 number,
        l1 varchar2(4000),
        s1 varchar2(200),
        l2 varchar2(4000),
        s2 varchar2(200),
        l3 varchar2(4000),
        s3 varchar2(200)
);

insert into t1 (n1,l1,s1) values(0,rpad('X',4000,'X'),rpad('X',200,'X'));
commit;

insert into t1 (n1,l1) values(1,null);
commit;

update t1 set
        l1 = rpad('A',4000),
        s1 = rpad('A',200),
        l2 = rpad('B',4000),
        s2 = rpad('B',200),
        l3 = rpad('C',4000),
        s3 = rpad('C',200)
where
        n1 = 1
;

commit;

execute dbms_stats.gather_table_stats(user,'t1');

update t1 set
        s1 = lower(s1),
        s2 = lower(s2),
        s3 = lower(s3)
where
        n1 = 1
;

alter system flush buffer_cache;

select
        dbms_rowid.rowid_relative_fno(rowid)    rel_file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        count(*)                                rows_starting_in_block
from
        t1
group by
        dbms_rowid.rowid_relative_fno(rowid),
        dbms_rowid.rowid_block_number(rowid)
order by
        dbms_rowid.rowid_relative_fno(rowid),
        dbms_rowid.rowid_block_number(rowid)
;

The query with all the calls to dbms_rowid gave me the file and block number of the row I was interested in, so I dumped the block, then read the trace file to find the next block in the chain, and so on. The first block held just the head piece, the second block held the n1 and l1 columns (which didn’t get modified by the update), the third block held the s1 and l2 columns, the last block held the s2, l3 and s3 columns. I had been expecting to see the split as (head-piece(, (n1, l1, s1), (l2, s2), (l3, s3) – but as it turned out the unexpected split was a bonus.

Here are extracts from each of the blocks (in the order they appeared in the chain), showing the ITL information and the “row overhead” information. If you scan through the list you’ll see that three of the 4 blocks have an ITL entry for transaction id (xid) of 8.1e.df3, using three consecutive undo records in undo block 0x0100043d. My update has locked 3 of the 4 rowpieces – the header and the two that have changed. It didn’t need to “lock” the piece that didn’t change. (This little detail was the bonus of the unexpected split.)


Block 184
---------
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.36  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.2e  ----    1  fsc 0x0000.00000000

...

tab 0, row 1, @0xf18
tl: 9 fb: --H----- lb: 0x2  cc: 0
nrid:  0x00800089.0



Block 137       (columns n1, l1 - DID NOT CHANGE so no ITL entry acquired)
---------       (the lock byte relates to the previous, not cleaned, update) 
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.35  --U-    1  fsc 0x0000.005beb39
0x02   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xfcb
tl: 4021 fb: ----F--- lb: 0x1  cc: 2
hrid: 0x008000b8.1
nrid:  0x00800085.0



Block 133 (columns s1, l2)
--------------------------
Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.34  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.2f  ----    1  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xf0b
tl: 4213 fb: -------- lb: 0x2  cc: 2
nrid:  0x008000bc.0



Block 188 (columns s2, l3, s3)
------------------------------
 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x000a.00b.00000ee1  0x01000bc0.036a.33  C---    0  scn  0x00000000005beb39
0x02   0x0008.01e.00000df3  0x0100043d.0356.30  ----    1  fsc 0x0000.00000000
0x03   0x0000.000.00000000  0x00000000.0000.00  C---    0  scn  0x0000000000000000

...

tab 0, row 0, @0xe48
tl: 4408 fb: -----L-- lb: 0x2  cc: 3

Note, by the way, how there are nrid (next rowid) entries pointing forward in every row piece (except the last), but it’s only the “F” (First) row-piece has the hrid (head rowid) pointer pointing backwards.

Comments (6)

December 21, 2018

QC vs. PX

Filed under: Oracle,Parallel Execution — Jonathan Lewis @ 12:26 pm GMT Dec 21,2018

Here’s a little puzzle with a remarkably easy and obvious solution that Ivica Arsov presented at the UKOUG Tech2018 conference. It’s a brilliant little puzzle that makes a very important point because it reminded me that most problems are easy and obvious only after you’ve seen them at least once. If you’ve done a load of testing and investigation into something it’s easy to forget that there may be many scenarios you haven’t thought of testing – so when you see the next puzzle your mind follows all the things you’ve done previously and doesn’t think that you might be looking at something new.

In this case I had to wait until the end of the presentation to discover how “easy and obvious” the solution was. Here’s a query with its results: all it does is join a session (from v$session) with its query coordinator (if there is one) by looking for the matching sid in v$px_session:


break on server_group skip 1 

select
        px.sid, px.qcsid,
        px.server_group, px.server_set, px.server#,
        ss.sql_id
from
        V$px_session px,
        v$session ss
where
        ss.username = 'TEST_USER'
and     ss.sid = px.sid
order by
        px.server_group nulls first, px.server_set, px.server#
;

     QCSID        SID SERVER_GROUP SERVER_SET    SERVER# SQL_ID
---------- ---------- ------------ ---------- ---------- -------------
       357        357                                    b4wg6286xn324

       357        125            1          1          1 bppfad1y1auhj
       357        246                       1          2 bppfad1y1auhj
       357        364                       1          3 bppfad1y1auhj

       357          7            2          1          1 5vdbyjy0c7dam
       357        133                       1          2 5vdbyjy0c7dam
       357        253                       1          3 5vdbyjy0c7dam

As you can see session 357 is reported as a query coordinator session, with two parallel server groups of 3 slave processes each. Strangely, though, the coordinator and the two groups of parallel query slaves are reporting different SQL_IDs which is probably contrary to the experience that most of us have had. When a parallel query (or DML or DDL statement) is executing the query coordinator and all its slave processes should report the same SQL_ID – so what’s happening here?

Little pause for thought …
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
… and no doubt some of you were immediately aware of the probable explanation. It’s very simple if you’ve come across the phenomenon before. Here’s the script that allowed me (from another session) to capture the “unexpected” result:


rem
rem     Script: px_qc_joke_2.sql
rem     Author: Jonathan Lewis
rem     Dated:  Dec 2018
rem

create table t1 nologging
as
select ao.*
from
        all_objects ao,
        (
         select rownum id
         from   dual
         connect by level <= 10     -->   comment to avoid wordpress format issue
        )
;

create table t2 nologging as select * from t1;
create table t3 nologging as select * from t1;

begin
        dbms_stats.gather_table_stats(user,'t1', method_opt=>'for all columns size 1');
        dbms_stats.gather_table_stats(user,'t2', method_opt=>'for all columns size 1');
        dbms_stats.gather_table_stats(user,'t3', method_opt=>'for all columns size 1');
end;
/

prompt  =====================
prompt  Starting PL/SQL block
prompt  Prepare to query v$
prompt  =====================

declare
        cursor c1 is select /*+ parallel (t1 3) */ object_id from t1;
        cursor c2 is select /*+ parallel (t2 3) */ object_id from t2;
        m_id1 number;
        m_id2 number;
begin
        open c1;
        fetch c1 into  m_id1;

        open c2;
        fetch c2 into  m_id2;

        for r in (select owner from t3 where object_id = least(m_id1,m_id2)) loop
                dbms_output.put_line(r.owner);
                dbms_lock.sleep(3);
        end loop;

        dbms_output.put_line(m_id1 || ' - ' || m_id2); 

        close c2;
        close c1;
end;
/

I’ve opened two cursors on parallel queries inside an anonymous PL/SQL block. The driving session is the query coordinator for two different parallel queries at the same time because it’s keeping two cursors open, and it’s also executing the cursor that is the driving query block. If we check v$sql for the three SQL_IDs reported from v$px_session this is what we see:


SQL_ID
-------------
SQL_TEXT
--------------------------------------------------------------------------------
b4wg6286xn324
declare  cursor c1 is select /*+ parallel (t1 3) */ object_id from t1;  cursor c
2 is select /*+ parallel (t2 3) */ object_id from t2;  m_id1 number;  m_id2 numb
er; begin  open c1;  fetch c1 into  m_id1;   open c2;  fetch c2 into  m_id2;   f
or r in (select owner from t3 where object_id = least(m_id1,m_id2)) loop   dbms_
output.put_line(r.owner);   dbms_lock.sleep(1);  end loop;   dbms_output.put_lin
e(m_id1 || ' - ' || m_id2);   close c2;  close c1; end;

bppfad1y1auhj
SELECT /*+ parallel (t1 3) */ OBJECT_ID FROM T1

5vdbyjy0c7dam
SELECT /*+ parallel (t2 3) */ OBJECT_ID FROM T2

Apart from the warning that it’s easy to be misled by a problem because you keep thinking of all the cases you’ve seen before there’s another important point behind this little quiz. It’s often said that when you execute a parallel query you may actually use “2 * DOP” parallel query slaves. This is often true (though for more complicated queries you may get multiple DFO trees at once, each with its “2 * DOP” slaves); but it’s worth remembering that even with very simple queries a single session can have many cursors open at once, holding “2 * DOP” slaves for each one, and ruining the response time for every other session because every other session ends up running serial queries.

Update (Feb 2019)

On re-reading this note it crossed my mind to wonder what would happen if parallel query queueing were enabled and you had two sessions executing PL/SQL blocks that ended up colliding in the following way:

Session 1: open parallel cursor 1a – using half the available PX slaves
Session 2: open parallel cursor 2a – using the rest of the PX slaves
Session 1: open parallel cursor 1b – demanding the rest of the slaves, and going into the parallel queue.
Session 2: open parallel cursor 2b – demanding the rest of the slaves, and going into the parallel queue.

If this scenario is possible you now have a “parallel queue deadlock” between the sessions with no obvious scope for one of them to “rollback” and allow the other to continue. One of the PL/SQL blocks probably has to fail and terminate (which could be very bad news if you’ve done some DML and commits in the middle of the PL/SQL, prior to opening both cursors.

It’s possible, of course, that there are configuration details that mean this collision can’t happen. One day I may find time to test the idea – but maybe an interested (or well-informed) reader will supply the answer (or a reference to the bit of the manual I should have read before writing this update).

Update (April 2024)

Following a recent comment on this note, it occurred to me that I should have mentioned explicitly that my join between v$session and v$px_session is only appropriate for single-instance Oracle. If you are running RAC you should join gv$session to gv$px_session and remember that the parallel execution processes may be running on a different instance from the query co-ordinator.

It’s also the case that the quick and dirty join I wrote didn’t order the results by the qcsid because I knew there was only going to be one parallel query executing in my private instance while I was testing; and I should have included the session serial# in the join because (hypothetically) a session could end and a new session could re-use the sid with a new serial# and the various dynamic performance views are not read-consistent and the join might manage to join a “prior serial#” row from one view to the corresponding “current serial#” row from the other.

A better query (written against 19.11.0.0 but untested) would be:


break -
        on qcsid skip 1 -
        on qcserial# -
        on qcinst_id -
        on server_group -       
        on server_set skip 1

select
        px.qcsid, 
        nvl(px.qcserial#, ss.serial#) qcserial#, 
        nvl(px.qcinst_id, ss.inst_id) qcinst_id,
        px.sid, px.serial#, px.inst_id,
        px.server_group, px.server_set, px.server#,
        ss.sql_id
from
        gv$session      ss,
        gv$px_session   px
where
        ss.username     = 'TEST_USER'
and     px.sid          = ss.sid
and     px.serial#      = ss.serial#
and     px.inst_id      = ss.inst_id
order by
        1,2,3,
        px.server_group nulls first, 
        px.server_set, 
        px.server#
;

Comments (3)

March 1, 2017

I don’t know (yet)

Filed under: CBO,Execution plans,Oracle,Parallel Execution — Jonathan Lewis @ 12:53 pm GMT Mar 1,2017

Here’s a question that came to mind while reading a recent question on the OTN database forum. It’s a question to which I don’t know the answer and, at present, I don’t really want to bother modelling – although if I were on a customer site and this looked like a probable explanation for a performance anomaly it’s the sort of thing I would create a model for.

If I have a query that runs parallel and does a “hash join buffered” (see, for example, this URL) it’s possible that the processes creating the build table will manage to create the build table in memory without spilling to disc but then find themselves unable to keep the incoming probe table data in memory and spill it to disc before re-reading it to do the join and forward the results to the parent process.

Here’s the plan from the URL above, showing rowsource execution stats:


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Time   |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem |  O/1/M   | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |          |        |      |            |  70101 |00:00:00.17 |      12 |      0 |      0 |       |       |          |         |
|   1 |  PX COORDINATOR         |          |      1 |          |        |      |            |  70101 |00:00:00.17 |      12 |      0 |      0 |       |       |          |         |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |      0 | 00:00:01 |  Q1,02 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|*  3 |    HASH JOIN BUFFERED   |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70101 |00:00:00.28 |       0 |    310 |    310 |  5952K|  1953K|     2/0/0|    2048 |
|   4 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70101 |00:00:00.08 |       0 |      0 |      0 |       |       |          |         |
|   5 |      PX SEND HASH       | :TQ10000 |      0 | 00:00:01 |  Q1,00 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|   6 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,00 | PCWC |            |  70101 |00:00:00.05 |    1031 |   1005 |      0 |       |       |          |         |
|*  7 |        TABLE ACCESS FULL| T1       |     26 | 00:00:01 |  Q1,00 | PCWP |            |  70101 |00:00:00.02 |    1031 |   1005 |      0 |       |       |          |         |
|   8 |     PX RECEIVE          |          |      2 | 00:00:01 |  Q1,02 | PCWP |            |  70102 |00:00:00.06 |       0 |      0 |      0 |       |       |          |         |
|   9 |      PX SEND HASH       | :TQ10001 |      0 | 00:00:01 |  Q1,01 | P->P | HASH       |      0 |00:00:00.01 |       0 |      0 |      0 |       |       |          |         |
|  10 |       PX BLOCK ITERATOR |          |      2 | 00:00:01 |  Q1,01 | PCWC |            |  70102 |00:00:00.04 |    1031 |   1005 |      0 |       |       |          |         |
|* 11 |        TABLE ACCESS FULL| T2       |     26 | 00:00:01 |  Q1,01 | PCWP |            |  70102 |00:00:00.01 |    1031 |   1005 |      0 |       |       |          |         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Note that we have a “hash join buffered” at operation 3, which means incoming data from t2 (the probe table) was buffered before being joined; but we can see 310 writes and reads to temporary space due to that operation. That’s actually t2 spilling from the buffer to disc and then being read back from disc after the t2 scan (at operation 11) completes.

If this pattern of activity appears how is it reflected in the statistics?

Because the build table was completely built in memory you could argue for an “optimal workarea execution”; but because the probe table was temporarily spilled to disc you could argue for a “one-pass workarea operation”. In this case I was able to say that the writes and reads for operation 3 were the t2 table spilling to disc because (a) I understand (mostly) how parallel hash joins work, (b) I had trace files I could check for timing, and (c) the O/1/M column of the output show 2 optimal operations and no one-pass operations.

The question to which I don’t (yet) know the answer is this: is the I/O caused by this spill to disc captured in v$pgastat under the stats:

NAME                                              VALUE UNIT
-------------------------------------------- ---------- ------------
extra bytes read/written                              0 bytes
cache hit percentage                                100 percent

If so, then you end up with questions like the one on OTN:

How come OEM reports my PGA cache Hit Ratio at 76% when I’ve got so much more memory than seems necessary ?

With the follow-up comment:

All my workarea executions were optimal

Footnote

Though I haven’t examined it recently, I remember noting in the past that the buffer allocation for the incoming probe table was typically a small fraction (often 1/8th or 1/16th) of the size of the buffer allocated for the build table. This may help to explain why you could end up doing a lot of I/O without exhausting the available memory.

November 20, 2023

Summary

July 13, 2023

The Tablescan path

Index Effects

More Indexes

Summary

Footnote

July 12, 2023

Addendum

January 15, 2023

Answer

Footnote

September 9, 2022

June 17, 2022

Summary

April 7, 2022

September 30, 2020

How parallel is parallel anyway?

Summary

February 18, 2020

June 13, 2019

Update (July 2020)

March 1, 2019

But there’s more …

January 18, 2019

Conclusion

Footnote

December 21, 2018

Update (Feb 2019)

Update (April 2024)

March 1, 2017

Footnote

Click on the motto below

Search this blog

Categories

Special Links

Recent Posts

Recent Comments

Popular articles

Popular References

Archives

Blogroll

Posts by RSS