Hash Joins

September 7, 2013

Hash Joins

Filed under: CBO,Execution plans,Hints,Oracle,Tuning — Jonathan Lewis @ 12:53 pm BST Sep 7,2013

I’ve written notes about the different join mechanisms in the past – but such things are always worth revisiting, so here’s an accumulated bundle of comments about hash joins.

A hash join takes two inputs that (in most of the Oracle literature) are referred to as the “build table” and the “probe table”. These rowsources may be extracts from real tables or indexes, or might be result sets from previous joins. Oracle uses the “build table” to build a hash table in memory, consuming and using the rowsource in a single call; it then consumes the “probe table” one row at a time, probing the in-memory hash table to find a match.

Access to the hash table is made efficient by use of a hashing function applied to the join columns – rows with the same value on the join column end up hashing to the same place in the hash table. It is possible for different input values to produce the same hash value (a hash collision) so Oracle still has to check the actual values once it has identified “probable” joins in the hash table. Because the comparison is based on a hashing mechanism, hash joins can only be used for join predicates that are equality predicates.

The hint to force a hash join is /*+ use_hash(rowsource_alias) */. This tells the optimizer that the join method to be used when “rowsource_alias” is the next rowsource in the join order should be a hash join; however it does not tell the optimizer whether that rowsource should be used as the build table or the probe table. To specify how the rowsource is used you need a second hint: no_swap_join_inputs(rowsource_alias) if you want Oracle to use the rowsource as the probe table, or swap_join_inputs(rowsource_alias) if you want Oracle to use it as the build table; for example, with a little cosmetic editing:


select  /*+ leading(table_1 table_2) use_hash(table_2) no_swap_join_inputs(table_2) */ *
from    t1 table_1, t2 table_2
where   table_2.n1 = table_1.n1
;

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 45000 |    16M|    44 |
|*  1 |  HASH JOIN         |      | 45000 |    16M|    44 |
|   2 |   TABLE ACCESS FULL| T1   |  3000 |   547K|    14 |
|   3 |   TABLE ACCESS FULL| T2   |  3000 |   547K|    14 |
-----------------------------------------------------------

select  /*+ leading(table_1 table_2) use_hash(table_2) swap_join_inputs(table_2) */ *
from    t1 table_1, t2 table_2
where   table_1.n1 = table_1.n1
;

-----------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost  |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      | 45000 |    16M|    44 |
|*  1 |  HASH JOIN         |      | 45000 |    16M|    44 |
|   2 |   TABLE ACCESS FULL| T2   |  3000 |   547K|    14 |
|   3 |   TABLE ACCESS FULL| T1   |  3000 |   547K|    14 |
-----------------------------------------------------------

If you were to look at the second execution plan without analysing the optimizer trace file you would probably assume that the join order (very specifically the thing that Oracle labels “join order” for the purposes of examining execution paths in a structured sequence) was (table_2, table_1) – it is important to remember that the “internal” join order and the apparent join order are not necessarily the same because the optimizer may have done some “side-swapping” for a plan using hash joins.

A common misunderstanding is that a hint of the form /*+ use_hash(table_1 table_2) */ is a directive to Oracle to do a hash join with table_1 as the build table and table_2 as the probe table. This is not the case; the hint in this form is simply a shorthand for a pair of single-table hints, viz: use_hash(table_1) use_hash(table_2). In the simplest case this translates as: “if table_1 is the second table in the join order then use a hash join to access it, if table_2 is the second table in the join order then use a hash join to access it”; given this interpretation it’s not too surprising that the hint appears to mean more than it really does.

There is a third hint that you can associate with hash joins, and this is used to dictate how one set of slaves should distribute data to the next set of slaves when running parallel queries – the pq_distribute() hint. The description in the manuals for this hint reports something like the following: pq_distribute( tablespec, inner_distribution, outer_distribution). I think this leaves some room for confusion – which table, for example, should be supplied in the tablespec, and how should you interpret inner and outer ? (The manual helpfully tells us that the outer_distribution is the distribution for the outer table, but that doesn’t really explain anything.) The “outer” table is the one specified in the hint, the “inner” table is the previous rowsource – and this doesn’t change even if you’ve used the swap_join_inputs() hint. Basically, when you hint a hash join for a table in a parallel query you need three hints to describe the hash join and for clarity you might as well make them three consecutive hints:


/*+
        use_hash(table_X)
        [no_]swap_join_inputs(table_X)
        pq_distribute(table_X {distribution for previous rowsource} {distribution for table_X})
*/

I won’t go into the details of the possible distribution methods and the effects they have. but here’s a little example demonstrating the hint. I’ve supplied an SQL statement (with the pq_distribute() hint commented out) and shown two plans – the first is the unhinted plan, the second shows the effect of my chosen distribution:

select  /*+
                leading(table_1 table_2)
                use_hash(table_2)
                swap_join_inputs(table_2)
--              pq_distribute(table_2 broadcast none)
        */ *
from    t1 table_1, t2 table_2
where   table_2.n1 = table_1.n1
;

-------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          | 45000 |    16M|    19 |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 | 45000 |    16M|    19 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED   |          | 45000 |    16M|    19 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE          |          |  3000 |   547K|     8 |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH       | :TQ10000 |  3000 |   547K|     8 |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR |          |  3000 |   547K|     8 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T2       |  3000 |   547K|     8 |  Q1,00 | PCWP |            |
|   8 |     PX RECEIVE          |          |  3000 |   547K|     8 |  Q1,02 | PCWP |            |
|   9 |      PX SEND HASH       | :TQ10001 |  3000 |   547K|     8 |  Q1,01 | P->P | HASH       |
|  10 |       PX BLOCK ITERATOR |          |  3000 |   547K|     8 |  Q1,01 | PCWC |            |
|  11 |        TABLE ACCESS FULL| T1       |  3000 |   547K|     8 |  Q1,01 | PCWP |            |
-------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost  |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          | 45000 |    16M|    19 |        |      |            |
|   1 |  PX COORDINATOR          |          |       |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 | 45000 |    16M|    19 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN             |          | 45000 |    16M|    19 |  Q1,01 | PCWP |            |
|   4 |     PX BLOCK ITERATOR    |          |  3000 |   547K|     8 |  Q1,01 | PCWC |            |
|   5 |      TABLE ACCESS FULL   | T2       |  3000 |   547K|     8 |  Q1,01 | PCWP |            |
|   6 |     BUFFER SORT          |          |       |       |       |  Q1,01 | PCWC |            |
|   7 |      PX RECEIVE          |          |  3000 |   547K|     8 |  Q1,01 | PCWP |            |
|   8 |       PX SEND BROADCAST  | :TQ10000 |  3000 |   547K|     8 |  Q1,00 | P->P | BROADCAST  |
|   9 |        PX BLOCK ITERATOR |          |  3000 |   547K|     8 |  Q1,00 | PCWC |            |
|  10 |         TABLE ACCESS FULL| T1       |  3000 |   547K|     8 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------

I’ve dictated the join order (t1, t2) and a hash join into t2, but I’ve also told the optimizer to swap the join inputs so t2 becomes the build table and when you look at the execution plan the join order appears to be (t2, t1).

By default – i.e. in the first plan – the optimizer used the (hash hash) distribution – so the first set of parallel execution slaves scanned t2 (sharing the task of the full tablescan fairly, we hope) and distributed the data “pseudo-randomly” (the hashing algorithm on the join column) between the slaves in the second set; then it scanned t1 and distributed that data using the same hashing algorithm on the join column. The second set of slaves built and probed with the partial data sets they were given.

In the second plan the first set of slaves scanned its selected portion of t2 and built a hash table from it, after which [see footnote] the second set of slaves started scanning t1 – but each slave in the second set sent every row it had scanned to every slave in the first set. This example is a little unusual, I’ve chosen it simply to demonstrate the principle; normally you might expect to broadcast the hash table if it produced a “small” data set as this tends to minimise the number of PX messages passed between slaves.

It’s worth pointing out that there’s a lot more you can do to investigate parallelism and hash joins – especially if you’re working with the Exadata database machine – and I will be publishing further notes on the topic. I’ll leave you with one warning, though. If you change the join in the example to an outer join: “table_2.n1(+) = table_1.n1” the (broadcast none) distribution will become invalid so the pq_distribute() hint will be “ignored”, even though the use_hash() and swap_join_inputs() will still be followed – you have to be very careful when trying to engineer parallel queries to follow a specific path.

Footnote: taking a closer look at the set of trace files generated in the broadcast test I discovered that the first set of slaves start their parallel tablescan of t1 but stop after just one read from each slave, then the second set of slaves scans and builds the hash table from t2 before calling for further data from t1.

Comments (21)

21 Comments »

Hello Jonathan

You say:
If you were to look at the second execution plan without analysing the optimizer trace file you would probably assume that the join order (very specifically the thing that Oracle labels “join order” for the purposes of examining execution paths in a structured sequence) was (table_2, table_1)

Is this then only valid for the second execution plan? If so why is this not true for the first execution plan?

Regards Hans-Peter

Comment by Hans-Peter — September 10, 2013 @ 7:20 am BST Sep 10,2013 | Reply
- Hans-Peter,
  I’m not sure I understand the thinking behind your question.
  
  In the serial execution plans, the first plan “looks” as if the join order examined by the optimizer was [t1, t2] – although, technically the trace file might show that the plan came from considering the join order [t2. t1]. Since we can see the hint, and the plan looks as if it is obeying we might not consider reviewing the trace file.
  
  The second plan “looks” as if the join order examined by the optimizer was [t2, t1], even though we have a leading() hint that requires the optimizer to consider only the join order [t1, t2] – so, if we didn’t examine the trace file, we might conclude that the optimizer had ignored the leading() hint if we weren’t aware of the significance of the swap_join_inputs() hint.
  
  Comment by Jonathan Lewis — September 10, 2013 @ 8:03 am BST Sep 10,2013 | Reply
  - Hi Jonathan,
    
    This article is quite nice.
    
    However, I am not able to understand the second line of the second paragraph of your response to Hans-Peter’s query – “although, technically the trace file might show that the plan came from considering the join order [t2. t1]. ”
    
    My question is from where the join order [t2. t1] is coming in picture as far as the first SQL is concerned ?
    
    Regards,
    Sudipta.
    
    Comment by sudiptabiswas2 — October 5, 2013 @ 9:52 am BST Oct 5,2013 | Reply
    - Sudipta,
      
      If you read the whole comment again, paying most attention to the second half, you will see that I’ve said about the SECOND plan that it looks as if Oracle has produced it by examining the join order (t2, t1) even though the hint told the optimizer to consider only the join order (t1, t2) – the point is that when considering the join order (X, Y) for a hash join the optimizer is allowed to decide which of X and Y should be the build table and which the hash unless explicitly hinted.
      
      If you apply the same thinking to the FIRST plan, we can see that it appears to be from the join order (t1, t2) — but you cannot see in the plan whether or not it appeared because Oracle was considering join order (t1, t2) or join order (t2, t1). The point is that the join order is NOT visible in the picture, and you have to check the trace file to discover the actual join order (or, in this particular case, believe that the hints are correct).
      
      Comment by Jonathan Lewis — October 5, 2013 @ 10:27 am BST Oct 5,2013
Hi,
You have understood my correctly.
Thanks for the clarification.
Regards Hans-Peter

Comment by Hans-Peter — September 11, 2013 @ 11:29 am BST Sep 11,2013 | Reply
Thanks Jonathan for the reply. But again I have a question which stems from your reply.

Consider the first line “If you apply the same thinking to the FIRST plan, we can see that it appears to be from the join order (t1, t2) — but you cannot see in the plan whether or not it appeared because Oracle was considering join order (t1, t2) or join order (t2, t1)” of the 2nd paragraph in your reply (10:27 am BST Oct 5,2013)

My question here is : The 1st plan clearly shows the join order is (t1,t2). Then why have you said “but you cannot see in the plan whether or not it appeared because Oracle was considering join order (t1, t2) or join order (t2, t1)” ? If Oracle was considering a join order of (t2,t1) then why in the plan will a join order of (t1,t2) appear ?

Regards,
Sudipta.

Comment by sudiptabiswas2 — October 5, 2013 @ 12:29 pm BST Oct 5,2013 | Reply
- Sudipta,
  
  Note this fragment in the article:
- “… (very specifically the thing that Oracle labels “join order” for the purposes of examining execution paths in a structured sequence) …”

Oracle Scratchpad

September 7, 2013

Hash Joins

21 Comments »

Comments and related questions are welcome. Cancel reply

Click on the motto below

Everything Changes Eventually

Search this blog

Categories

Special Links

Recent Posts

Recent Comments

Popular articles

Popular References

Archives

Blogroll

Posts by RSS

Oracle Scratchpad

September 7, 2013

Hash Joins

Share this:

Related

21 Comments »

Comments and related questions are welcome. Cancel reply

Click on the motto below

Everything Changes Eventually

Search this blog

Categories

Special Links

Recent Posts

Recent Comments

Popular articles

Popular References

Archives

Blogroll

Posts by RSS