Here’s a question that came to mind while reading a recent question on the OTN database forum. It’s a question to which I don’t know the answer and, at present, I don’t really want to bother modelling – although if I were on a customer site and this looked like a probable explanation for a performance anomaly it’s the sort of thing I would create a model for.
If I have a query that runs parallel and does a “hash join buffered” (see, for example, this URL) it’s possible that the processes creating the build table will manage to create the build table in memory without spilling to disc but then find themselves unable to keep the incoming probe table data in memory and spill it to disc before re-reading it to do the join and forward the results to the parent process.
Here’s the plan from the URL above, showing rowsource execution stats:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Time | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | O/1/M | Max-Tmp | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | | | | 70101 |00:00:00.17 | 12 | 0 | 0 | | | | | | 1 | PX COORDINATOR | | 1 | | | | | 70101 |00:00:00.17 | 12 | 0 | 0 | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10002 | 0 | 00:00:01 | Q1,02 | P->S | QC (RAND) | 0 |00:00:00.01 | 0 | 0 | 0 | | | | | |* 3 | HASH JOIN BUFFERED | | 2 | 00:00:01 | Q1,02 | PCWP | | 70101 |00:00:00.28 | 0 | 310 | 310 | 5952K| 1953K| 2/0/0| 2048 | | 4 | PX RECEIVE | | 2 | 00:00:01 | Q1,02 | PCWP | | 70101 |00:00:00.08 | 0 | 0 | 0 | | | | | | 5 | PX SEND HASH | :TQ10000 | 0 | 00:00:01 | Q1,00 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | 0 | | | | | | 6 | PX BLOCK ITERATOR | | 2 | 00:00:01 | Q1,00 | PCWC | | 70101 |00:00:00.05 | 1031 | 1005 | 0 | | | | | |* 7 | TABLE ACCESS FULL| T1 | 26 | 00:00:01 | Q1,00 | PCWP | | 70101 |00:00:00.02 | 1031 | 1005 | 0 | | | | | | 8 | PX RECEIVE | | 2 | 00:00:01 | Q1,02 | PCWP | | 70102 |00:00:00.06 | 0 | 0 | 0 | | | | | | 9 | PX SEND HASH | :TQ10001 | 0 | 00:00:01 | Q1,01 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | 0 | | | | | | 10 | PX BLOCK ITERATOR | | 2 | 00:00:01 | Q1,01 | PCWC | | 70102 |00:00:00.04 | 1031 | 1005 | 0 | | | | | |* 11 | TABLE ACCESS FULL| T2 | 26 | 00:00:01 | Q1,01 | PCWP | | 70102 |00:00:00.01 | 1031 | 1005 | 0 | | | | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Note that we have a “hash join buffered” at operation 3, which means incoming data from t2 (the probe table) was buffered before being joined; but we can see 310 writes and reads to temporary space due to that operation. That’s actually t2 spilling from the buffer to disc and then being read back from disc after the t2 scan (at operation 11) completes.
If this pattern of activity appears how is it reflected in the statistics?
Because the build table was completely built in memory you could argue for an “optimal workarea execution”; but because the probe table was temporarily spilled to disc you could argue for a “one-pass workarea operation”. In this case I was able to say that the writes and reads for operation 3 were the t2 table spilling to disc because (a) I understand (mostly) how parallel hash joins work, (b) I had trace files I could check for timing, and (c) the O/1/M column of the output show 2 optimal operations and no one-pass operations.
The question to which I don’t (yet) know the answer is this: is the I/O caused by this spill to disc captured in v$pgastat under the stats:
NAME VALUE UNIT -------------------------------------------- ---------- ------------ extra bytes read/written 0 bytes cache hit percentage 100 percent
If so, then you end up with questions like the one on OTN:
How come OEM reports my PGA cache Hit Ratio at 76% when I’ve got so much more memory than seems necessary ?
With the follow-up comment:
All my workarea executions were optimal
Footnote
Though I haven’t examined it recently, I remember noting in the past that the buffer allocation for the incoming probe table was typically a small fraction (often 1/8th or 1/16th) of the size of the buffer allocated for the build table. This may help to explain why you could end up doing a lot of I/O without exhausting the available memory.
Comments and related questions are welcome.