Oracle Scratchpad

November 5, 2008

PX Buffer

Filed under: Execution plans,Parallel Execution,Performance,trace files,Troubleshooting — Jonathan Lewis @ 7:11 pm GMT Nov 5,2008

Here’s a surprising anomaly that showed up in a question on the OTN forums a little while ago. Consider a simple query that uses a hash join between two tables.

select
	/*+
		ordered
		use_hash(t2)
		parallel(t1 2)
		parallel(t2 2)
		pq_distribute(t2 hash hash)
	*/
	t1.padding,
	t2.padding
from 	t1, t2
where	t2.n1 = t1.n1
and	t2.small_vc = t1.small_vc
;

When it runs serially the join completes in memory and the only I/O you see comes from the two tablescans. When the query runs parallel something causes a spill to the temporary tablespace.

Here’s the code to build the tables, followed by the execution plans from 9.2.0.8 and 10.2.0.3:

alter session set workarea_size_policy = manual;
alter session set hash_area_size = 10495760;

create table t1
nologging		-- adjust as necessary
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	rownum				n1,
	lpad(rownum,6,'0')		small_vc,
	lpad(rownum,200,'0')		padding
from
	generator	v1,
	generator	v2
where
	rownum <= 10000
;

create table t2
nologging		-- adjust as necessary
as
with generator as (
	select	--+ materialize
		rownum 	id
	from	all_objects
	where	rownum <= 3000
)
select
	1 + mod(rownum,10000)			n1,
	lpad(1 + mod(rownum,10000),6,'0')	small_vc,
	lpad(rownum,500,'0')			padding
from
	generator	v1,
	generator	v2
where
	rownum <= 20000
;

-- gather statistics here

Execution plan: 9.2.0.8
-------------------------------------------------------------------
| Id| Operation          | Name| Rows |  TQ   |IN-OUT| PQ Distrib |
-------------------------------------------------------------------
|  0| SELECT STATEMENT   |     |    2 |       |      |            |
|* 1|  HASH JOIN         |     |    2 | 98,02 | P->S | QC (RAND)  |
|  2|   TABLE ACCESS FULL| T1  |10325 | 98,00 | P->P | HASH       |
|  3|   TABLE ACCESS FULL| T2  |20212 | 98,01 | P->P | HASH       |
-------------------------------------------------------------------

Execution plan 10.2.0.3
--------------------------------------------------------------------------------
| Id | Operation               | Name     | Rows  |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------
|  0 | SELECT STATEMENT        |          |     2 |        |      |            |
|  1 |  PX COORDINATOR         |          |       |        |      |            |
|  2 |   PX SEND QC (RANDOM)   | :TQ10002 |     2 |  Q1,02 | P->S | QC (RAND)  |
|* 3 |    HASH JOIN BUFFERED   |          |     2 |  Q1,02 | PCWP |            |
|  4 |     PX RECEIVE          |          | 10000 |  Q1,02 | PCWP |            |
|  5 |      PX SEND HASH       | :TQ10000 | 10000 |  Q1,00 | P->P | HASH       |
|  6 |       PX BLOCK ITERATOR |          | 10000 |  Q1,00 | PCWC |            |
|  7 |        TABLE ACCESS FULL| T1       | 10000 |  Q1,00 | PCWP |            |
|  8 |     PX RECEIVE          |          | 20000 |  Q1,02 | PCWP |            |
|  9 |      PX SEND HASH       | :TQ10001 | 20000 |  Q1,01 | P->P | HASH       |
| 10 |       PX BLOCK ITERATOR |          | 20000 |  Q1,01 | PCWC |            |
| 11 |        TABLE ACCESS FULL| T2       | 20000 |  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------

I’ve created table t1 with 10,000 rows and a (unique) column n1 that ranges from 1 to 10,000. Table t2 then has 20,000 rows, again with an n1 column that ranges from 1 and 10,000 in this case with two rows per value. When you join the two tables on their n1 columns, the result set will have 20,000 rows.

The table-creation code is more complex than it really needs to be, but I wrote it that way in case I needed to create a few million rows in each table.

You will notice that I have switched to manual workarea management so that I can set the hash_area_size (rather than letting Oracle pick a dynamic limit based on the pga_aggregate_target and current PGA usage). This is simply to make the test case reproducible. I’ve also included the pq_distribute() hint in the query to tell the optimizer to use the hash/hash distribution rather than broadcasting the first table (which could be forced by the hint /*+ pq_distribute(t2, broadcast, none) */). In my example, the optimizer took the hash/hash option automatically and I have just included the hint for reasons of reproducibility, and to demonstrate the hint syntax.

Mechanisms:

Checking the execution plans, we can build the following picture:

One set of PX slaves scans t1 (referenced as para_1 in the picture) and distributes the data by hashing (on the n1 column) to the second set of slaves by writing virtual table (or table queue) Q0, then the same set of slaves scans table t2 (para_2) and distributes the data in the same way, using virtual table Q1.

The second set joins the two virtual tables by reading Q0 to build the hash table then probing the hash table with each row from Q1 as it arrives, passing each join result to the query coordinator by writing to virtual table Q2.

Update: in response to a question on the OTN DB Forum, I’ve been prompted to point out the significance of the “98,{N}” in the 9i plan and the “TQ1,{N}”.  The first number is the “dataflow operation (DFO)” – it’s identifying a single parallel flow of data all the way through the query; the second number is the virtual table number within data flow operation.  The two numbers appear in combination in the 10g “Name” column to identify completely the virtual table that an operation is populating. (If you see more than one dataflow operation in a single query then something inefficient may be going on.)

Data Volume

So, I’ve got 10,000 rows of about 200 bytes each for a total of about 2MB, and a table of 20,000 rows of about 500 bytes each for a total of about 10MB, and a join result of 20,000 rows of about 700 bytes for a total of 14MB. Since we are running parallel 2, we might expect each PX slave to see about 5,000 rows (1 MB) from t1, 10,000 rows (5MB) and 10,000 rows (7MB) of result.

Since we set the hash_area_size to 10MB, we seem to have enough memory to hold the hash table completely in memory. We could even hold the hash table and all the data we use from t2 in memory – we could even hold the hash table and the result set in memory. So why to we see writes and reads to the temporary tablespace?

Analysis

The first thing to do, of course, is to find out where that I/O comes from. Enabling events 10046 (wait tracing) and 10104 (hash join tracing) is a good starting point.

The 10104 trace (9i version shown below) starts like this (the line numbers are my addition to the normal trace):

*** HASH JOIN STATISTICS (INITIALIZATION) ***
 1 Original memory: 10495760
 2 Memory after all overhead: 10342503
 3 Memory for slots: 9691136
 4 Calculated overhead for partitions and row/slot managers: 651367
 5 Hash-join fanout: 8
 6 Number of partitions: 9
 7 Number of slots: 13
 8 Multiblock IO: 91
 9 Block size(KB): 8
10 Cluster (slot) size(KB): 728
11 Minimum number of bytes per block: 8160
12 Bit vector memory allocation(KB): 512
13 Per partition bit vector length(KB): 64
14 Maximum possible row length: 1047
15 Estimated build size (KB): 2636
16 Estimated Row Length (includes overhead): 540
17 # Immutable Flags:
18   BUFFER the output of the join for Parallel Query

The trace reports 10MB of memory available for the operating the hash join (line 1).

The optimizer has pre-empted the possibility of the hash table spilling to disc by allowing for 9 partitions (line 6) although it talks about the “fanout” being 8 (line 5). Now the number of partitions is “always” a power of 2 and “always” matches the fanout – so what’s that extra partition for ?

We see that the optimizer has decided to use units of 91 blocks for any I/O to the temporary tablespace (line 8 ) – this translates into a size of 728KB (line 10) as the unit I/O (“cluster”, or “slot”) size. (In terms of parameters, this setting can be dictated by the parameter hash_multiblock_io_count – which is hidden in newer versions of Oracle).

This “slot” size has a very important impact on the efficiency of the hash join. To start with each partition needs minimum of one slot to hold data. This means that the minimum size in memory of the hash table will have to be at least 5.6MB (728KB * 8 ) because the hash table is going to be partitioned 8 ways. We can see this a little further down the trace file, as the hash table build completes.

*** HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of partitions: 8
Number of partitions which could fit in memory: 8
Number of partitions left in memory: 8
Total number of slots in in-memory partitions: 8
Total number of rows in in-memory partitions: 5029
   (used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 77792
### Partition Distribution ###
Partition:0    rows:577        clusters:1      slots:1      kept=1
Partition:1    rows:605        clusters:1      slots:1      kept=1
Partition:2    rows:634        clusters:1      slots:1      kept=1
Partition:3    rows:630        clusters:1      slots:1      kept=1
Partition:4    rows:635        clusters:1      slots:1      kept=1
Partition:5    rows:645        clusters:1      slots:1      kept=1
Partition:6    rows:665        clusters:1      slots:1      kept=1
Partition:7    rows:638        clusters:1      slots:1      kept=1

This means the entire hash table has been built completely in memory (kept = 1 for every partition) with a huge amount of space to spare. We have 5,029 rows in memory, and could handle an estimated 77,792 in the space allocated.

After the build completes, the 10046 trace details show the process reading data from the next virtual queue then, a bit further down the file, we get the following extra clues from the 10104 trace:

kxhfWrite: hash-join is spilling to disk
kxhfWrite: Writing dba=23177 slot=8 part=8
...
qerhjFetch: PQ Partition rows:10058      clusters:8      in-memory slots      4

We have been dumping partition 8 to disc – but the hash table is in partitions 0 to 7, so what are we dumping?

There are only two possibilities – either Oracle is copying the incoming probe table to partition 8 before it does the join, or it is writing the result of the join to partition 8 before re-reading it and forwarding it to the query coordinator. A couple of simple checks (e.g. adjust the n1 values in t2 to ensure that the join returns no data) shows that it’s the result set that is being dumped.

Looking back at lines 17 and 18 of the trace file, we can see that this was dictated by the “immutable flags” which were set to “BUFFER the output of the join for Parallel Query”.

Why ?

This behaviour is astonishing – there is clearly no need to do things this way; it’s a total waste of effort and time. Why write, re-read, then forward the result set when it could simply be sent as it was created ? Alternatively, why doesn’t the PX slave write its virtual table (Q2) to the temporary tablespace and let the Query Co-ordinator read it ? The code seems to be doing something quite bizarre.

Here’s my guess. First, we’re looking at a special case; secondly, in the more general case a different read/write strategy could become quite complex – especially when the query started to run across multiple RAC instances.

Generically, a single DFO (data flow operation) may involve far more than two layers of parallel execution; but a single DFO can use only two sets of parallel execution slaves. So if you have a parallel execution plan had three “layers” of operations, what happens: slave set one does the first operation, slave set two does the second operation, and slave set one does the third operation.

But slave set two (in the middle) may have some output ready for the third operation (slave set one) before the first operation (also slave set one) has completed – which means slave set one won’t be in place to receive the output. If this happens, slave set two has to do something with the output – which explains why it has to have some sort of buffer operation.

My guess is that it is this buffering operation that appears in our example, even though in this particular case it is “obviously” not needed.

If you accept the need for some sort of buffering mechanism, the next question is, why not make one layer of slaves write to the temporary tablespace and the next layer read from it. This could cut out a huge amount of messaging between two layers of slaves – the only information that would have to be sent through the SGA is a list of which blocks in the temporary segment should be read by which slaves.

This sounds reasonable – but the code needed to handle the generic case may be sufficiently complicated that the cost/risk/benefit analysis says that it’s not worth doing.

Think about a simple cycle where slave set 1 writes to the temporary tablespace for slave set 2 to read, and slave set 2 writes to the temporary tablespace for slave set 1. We don’t need a more complicated example because slave set 1 is not going to start reading for the third operation until after it has finished the first operation, so a double layer is a complex as things can get.

But, even though we have a limit of two layers, think about the number of communication channels. If we are running degree N there are N slaves in slave set 1 which need to distribute data across N slaves in slave set 2, so we need to isolate N*N channels of communication. Then you have to double that number because you could have two layers of communication live at once.

There is Metalink note somewhere that points out the amount of memory you should allocate for PX messaging in the large pool is something like 3 * parallel_execution_message_size * parallel_max_servers * parallel_max_servers / 4 (which allows 3 buffer pages per possible channel when one query runs at degree parallel_max_servers / 2).

If you try to switch this messaging to the temporary tablespace, would you allocate one extent for each channel – probably not, as that could need a huge, and sparsely used, temporary tabelspace.

On the other hand if you tried something like one private segment per query, or per operation per query, you then have to write code that allows slaves in the same layer to negotiate with each other to ensure that they don’t overwrite each others blocks in that extent. This is the efficient, but complex, solution. And when you realise that this could include communication across a RAC interconnect, you can appreciate that Oracle Corp. might avoid rewriting the code unless it proved to be absolutely necessary for a number of really large customers.

So parallel queries may do far more I/O to the temporary tablespace than you expect (and this is just one of the reasons). The “HASH JOIN BUFFERED” operation that becomes visible in 10g (see line 3 of the plan above) may tell you where some of that I/O is coming from. (The same action appears in all versions I tested from 8i to 11g, by the way – it’s just that 10g is the first version to report the operation).

Damage limitation

If you do find that your parallel hash joins are working harder than you expect, check the plan, and check the distribution. It isn’t well known, but you can make a huge difference to the resource consumption by controlling exactly what Oracle does with parallel hash joins. In this particular case, switching to a “broadcast” distribution bypassed the need for any sort of buffering at all, and the I/O to the temporary tablespace disappeared completely. I’ll be writing more about that option some time in the future.

Footnote:

Although the hash join spilled to disc (direct path writes and reads to the temporary tablespace) it still appeared in v$sql_workarea as an optimal workarea operation, and the session statistics reported it (twice because of the parallelism) under “workarea executions – optimal”.

12 Comments »

  1. Hi Jonathan,

    One of our customers wants to have his temporary tablespaces put on RAM disks.
    It has been done in the past for other databases.
    We are a little reluctant to do so but reading your article I conclude that they have a point there.
    Just setting pga_aggregate_target higher will not always reduce writing to the temporary tablespace as I understand.

    Regards Hans-Peter

    Comment by Hans-Peter Sloot — November 17, 2008 @ 11:03 am GMT Nov 17,2008 | Reply

  2. Hans-Peter,

    As ever it’s a trade-off between how much risk it introduces (which is presumably why you are reluctant) and how much benefit they might get.

    The unexpected I/O from PX slaves might be sufficient to make this worth considering – on the other hand the difference in performance might be insignificant if they have a large SAN cache anyway.

    Comment by Jonathan Lewis — November 17, 2008 @ 10:34 pm GMT Nov 17,2008 | Reply

  3. Jonathan,

    maybe this could be done also to cope with a slow client, or the risk of the client being slow. That is, “partition 8″ (that buffers the result set) is dumped to the temp tablespace to enable the kernel to free all the RAM it allocated (that could be huge of course). The result set rows are then read “slowly” as the client asks “slowly” for then by issuing the FETCH db-calls (more in detail, as the virtual tables Q2 become empty thanks to the client reading).

    I’m thinking to something similar to the mechanisms linked to the “sort_area_retained_size” parameter for serial sorts.

    It couldn’t be that bad when the client is fast enough to consume the result-set as it is produced: when the hash join completes, the buffer is empty, hence there will be nothing to dump.

    Comment by Alberto Dell'Era — December 22, 2008 @ 11:27 am GMT Dec 22,2008 | Reply

  4. Jonathan,

    In a conference you said that until 10.2.0.2 Oracle used binary insertion trees to maintain a map for merges. What are the differences in the mechanism for sorting in 10.2.0.3 , 10.2.0.4 (and 11g) ?

    Regards.

    Comment by Ricardo — January 7, 2009 @ 2:33 pm GMT Jan 7,2009 | Reply

  5. Ricardo,

    That was a description of the sorting mechanism – with the side effect that the memory needed for sorting N items of data required an overhead of roughly (N * 3 * pointer size) bytes – where a pointer is 32 bytes in 4 bytes in 32-bit systems and 8 bytes in 64-bit systems.

    I don’t know the algorithm that is used for the so-called “V2″ sort in 10.2, but it seems to need an overhead of just one pointer per item. Not only is it less memory intensive, the algorithm is also less CPU intensive.

    I think I suggested in my book that it might be a variation of the heap sort algorithm, but I recall an informal comment from an Oracle employee it wasn’t.

    Comment by Jonathan Lewis — January 12, 2009 @ 3:15 pm GMT Jan 12,2009 | Reply

  6. […] be confirmed: Does the ora_hash() function plan any part in parallel query distribution by hash ? One quick test suggests not – but the method of use in this case may be more subtle than […]

    Pingback by ora_hash function « Oracle Scratchpad — November 23, 2009 @ 1:16 pm GMT Nov 23,2009 | Reply

  7. Jonathon,

    in this thread, on link

    http://oracledoug.com/serendipity/index.php?/archives/774-Direct-Path-Reads.html

    The trick with the “direct path reads” is that Oracle manages to dispatch N buffers for read requests (probably 4), which is a CPU and memory intensive strategy, then comes back to the first buffer – which in this case requires a very short wait to fill. Then Oracle uses the data, dispatches a new read and moves on to the second buffer. In the interim the second read has nearly completed, but requires a few more microseconds… if Doug’s hardware were just a little faster, we wouldn’t see any direct path read waits.

    Jonathon since parallel processing uses direct path read. if my DOP is 3 and i have 3 producer processes procedusing data for consumer processes. Since PGA is being used for reading data, in which area of PGA data is being buffered and can we increase and decrease the size of that area.

    Comment by Amir Riaz — October 22, 2010 @ 10:08 am GMT Oct 22,2010 | Reply

    • There are at least three different areas to consider.
      a) Buffer memory the process reserves for reading data blocks. This is process memory and is accounted for against the pga_aggregate_target; and I believe there is some scope for adjusting this by playing with the _db_file_direct_io_count (with the usual warning about hidden parameters, of course)

      b) Memory used by PX slaves to pass data to other PX slaves – this comes from the SGA, and possibly the only way to modify it would be to change the parallel_execution_message_size (originally the default was ca. 2KB).

      c) Memory used by a slave for the “PX Buffer” type of operation described in this note, and that comes out of a normal ‘pga workarea’ – with the usual limitations on how you can affect the amount of memory that can be used for workareas.

      Comment by Jonathan Lewis — October 22, 2010 @ 9:47 pm GMT Oct 22,2010 | Reply

  8. Jonathan,

    >switching to a “broadcast” distribution bypassed the need for any sort of buffering at all, and the I/O to the temporary tablespace disappeared completely

    Documentation ( http://docs.oracle.com/cd/E11882_01/server.112/e26088/sql_elements006.htm#BABCJHAF Table 3-23 ) lays down few rules about which distribution method should be used in which case (basically on the basis of difference in size of two table being hash joined)

    Your comments on that ? Did you mean the same thing by suggesting “to switch to broadcast distribution in some cases) ?

    Regards,
    Amardeep Sidhu

    Comment by Amardeep Sidhu — July 27, 2012 @ 8:49 am GMT Jul 27,2012 | Reply

    • Amardeep,

      Thank you for your comment – it has drawn my attention to the fact that 11g (possible 11.2) has added a new feature to the pq_distribute hint that was in (or, at least) was not documented for 10g. The “single distribution” option for dealing with data loading.

      The posting is simply an example of the type of investigation and level of knowledge required to make a decision of the type indicated by the manuals, so the quick answer to your last question is: “yes”.

      Comment by Jonathan Lewis — August 7, 2012 @ 6:04 pm GMT Aug 7,2012 | Reply

  9. […] for Statistics in 11g, ending with Randolf Geist on Parallel Execution (including a reference to one of my older blog items which I now think could well be wrong – so I may have to spend Thursday reviewing it and […]

    Pingback by Catch-up « Oracle Scratchpad — December 11, 2012 @ 5:12 pm GMT Dec 11,2012 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,429 other followers