Joins – HJ | Oracle Scratchpad

August 10, 2010

Joins – HJ

Filed under: CBO,Execution plans,Performance — Jonathan Lewis @ 6:43 pm BST Aug 10,2010

In the second note on my thesis that “all joins are nested loop joins with different startup costs” I want to look at hash joins, and I’ll start by going back to the execution plan I posted on “Joins – NLJ”. (For a quick reference list of URLs to all three articles in turn, see: Joins.)

---------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |              |     1 |   191 |     3   (0)| 00:00:01 |
|   1 |  NESTED LOOPS                |              |     1 |   191 |     3   (0)| 00:00:01 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1           |     2 |   152 |     3   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN          | T1_PK        |     2 |       |     2   (0)| 00:00:01 |
|*  4 |   TABLE ACCESS HASH          | HASHED_TABLE |     1 |   115 |            |          |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("T1"."ID">=1 AND "T1"."ID"<=2)
   4 - access("HT"."ID"="T1"."N1")
       filter("HT"."SMALL_VC"='a')

In this query I created my second table as a “single table hash cluster” so, if I’ve done a perfect job (and been a bit lucky), for each row selected from the first table (t1) Oracle will hit one “cache buffers chains” latch to find the correct block, go directly to the correct row in that block, read the required columns, and drop the latch. This action is reported in the session statistics (v$sesstat) as a “consistent get – examination” and is probably the most efficient way of getting data from an “unpinned” block in the buffer cache. (For a few more esoteric details, see this note on my old website).

So this particular path for a simple nested loop is about as efficient as it can be. But what performance threats does it still have, and can we make it better ?

What if it’s a very big single table hash cluster – perhaps every row is 500 bytes long and our query is only interested in a couple of columns that are 10 bytes long: in this case we may find that the table is so big that we can’t keep the blocks we want in memory as the query runs and have to keep repeating random I/Os to get at them. Even if we can keep the blocks in memory there may be other users who are also constantly looking at the same table and blocks: so we may find that we are constantly competing for the “cache buffers chains” latches that protect the blocks in that table. (Of course most tables are simple heap table with B-tree indexes where we could be walking through the root, a branch, and a leaf block of an index before hitting the required table block – which means we could be competing for several latches for each row we want to see from the second table in our nested loop.)

Can we do better than the nested loop into a single table hash cluster ? Yes, we can, if we really need to. We can copy just the bits of the table we need, just once, into a private, in-memory, single table hash cluster (or “hash table” as it’s usually called in this context) before we start running the join. But if we do that we call it a hash join and the tables appear to be back to front in the execution plan.

Here’s an example, running under 11.1.0.6, to demonstrate the principle:


rem
rem     Script:         hash_nl.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Aug 2010
rem     Purpose:        
rem
rem     Last tested 
rem             11.1.0.6
rem

create table hash_target(
        hash_value      number,
        status          varchar2(2),
        small_vc        varchar2(10),
        padding         varchar2(100)
)
;

insert into hash_target 
select
        rownum,
        dbms_random.string('U',2),
        lpad(rownum,10),
        rpad('x',100)
from
        all_objects
where
        rownum <= 1000
;

create index ht_i1 on hash_target(status);
create unique index ht_i2 on hash_target(hash_value);

create table t1(
        join_value      number,
        status          varchar2(1),
        small_vc        varchar2(10),
        padding         varchar2(100)
)
;

insert into t1 
select
        mod(rownum,1000),
        dbms_random.string('U',1),
        lpad(rownum,10),
        rpad('x',100)
from
        all_objects
where
        rownum <= 20000
;

create index t1_i1 on t1(status);

-- gather stats at this point.

set autotrace traceonly explain

select
        t1.small_vc,
        ht.small_vc
from
        t1,
        hash_target     ht
where
        t1.status = 'X'
and     ht.hash_value = t1.join_value
and     ht.status = 'XX'
;

set autotrace off

The way I’ve written this query, a possible execution path would be a nested loop using an indexed access (or tablescan) into t1 to collect the data for t1.status = ‘X’ then, for each row acquired, an indexed access path on the unique index ht_i2 into hash_target. Here’s a possible execution plan (with the 11g nlj_batching disabled to make the plan look more like the tradition 9i/10g plan):

---------------------------------------------------------------------------
| Id  | Operation                   | Name        | Rows  | Bytes | Cost  |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |             |     3 |   105 |   827 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| HASH_TARGET |     1 |    18 |     1 |
|   2 |   NESTED LOOPS              |             |     3 |   105 |   827 |
|*  3 |    TABLE ACCESS FULL        | T1          |   769 | 13073 |    58 |
|*  4 |    INDEX UNIQUE SCAN        | HT_I2       |     1 |       |       |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("HT"."STATUS"='XX')
   3 - filter("T1"."STATUS"='X')
   4 - access("HT"."HASH_VALUE"="T1"."JOIN_VALUE")

This isn’t very efficient, of course, because we will pick up several hundred rows from t1, and index into hash_target several hundred times. We could improve efficiency in this case if hash_target were a single table hash cluster – but it isn’t, so Oracle “fakes” it. We pay the overhead of Oracle collecting all the relevant data (the select and join columns where ht.status = ‘XX’) and building this subset of the data as a single table hash cluster in memory so that it can then do the nested loop join with using a variant of “table access hash” into this in-memory table.

But since the first physical operation was acquiring data from hash_target that’s the table the optimizer reports as the first child of the join operation and since it’s building a single table in-memory hash cluster it calls the operation a hash join.

----------------------------------------------------------------------------
| Id  | Operation                    | Name        | Rows  | Bytes | Cost  |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |             |     3 |   105 |    62 |
|*  1 |  HASH JOIN                   |             |     3 |   105 |    62 |
|   2 |   TABLE ACCESS BY INDEX ROWID| HASH_TARGET |     2 |    36 |     3 |
|*  3 |    INDEX RANGE SCAN          | HT_I1       |     2 |       |     1 |
|*  4 |   TABLE ACCESS FULL          | T1          |   769 | 13073 |    58 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("HT"."HASH_VALUE"="T1"."JOIN_VALUE")
   3 - access("HT"."STATUS"='XX')
   4 - filter("T1"."STATUS"='X')

So that’s it: a hash join is a nested loop join where Oracle has extracted a subset of the data from the thing that would have been the “inner” table and recreated it as an in-memory single table hash cluster. The startup cost is the cost of getting the data once and building the hash cluster in memory; the saving comes from eliminating the repeated buffer visits and latch competition that we would see if Oracle left the data in the buffer cache and accessed it through the typical indexed access mechanisms.

Inevitably the cost/saving equation becomes difficult when the “in-memory hash cluster” can’t fit in memory and spills to disc, increasing the physical I/O count … but if that does happen then maybe the nested loop with indexed access might have been doing a lot of random disk I/O as well. But if you spend at least a little time thinking of hash joins as a variant of nested loops, it becomes easier to focus on what bottleneck you are trying to avoid and what efficiencies you are trying to achieve.

*** Footnote: for a nested loop join the first child table (or rowsource, if you want to be fussy) is usually called the “outer” table, and the second the “inner” table. For a hash join the first child table is called the “build” table – from which we build the in-memory hash cluster – and the second child table is called the “probe” table – which we use to probe the in-memory hash cluster for matches.

Comments (27)

27 Comments »

Hi Jonathan,

I am trying to understand the CBO concepts and variant join operations like Hash Join, FTS, Nested Loop, and the other items and Oracle hints. When i want to join two partitioned or non-partitioned tables together on their common column, most of the times i saw “nested loops”. When i force it to use the hash join (/*+ USE_HASH(A,B) */) it gets better and faster. I don’t want to generalize saying “hash joins are always better”. Do you have any comments on that issue and where to use the hash join or where to use the nested loops? One thing that i can add is when i want to join two same tables together to have the delta values, the nested loop is better with an index.

Thank you sharing by the way, nice article.

Ogan

Comment by Ogan Ozdogan — August 10, 2010 @ 6:53 pm BST Aug 10,2010 | Reply
- Ogan,
  
  I think you need to work through a copy of “Cost Based Oracle – Fundamentals” to get a solid grounding in how the optimizer handles basic mechanisms, otherwise my answer probably won’t help you understand what’s going on.
  
  “most of the times i saw “nested loops”. When i force it to use the hash join (/*+ USE_HASH(A,B) */) it gets better and faster”
  
  The choice of join is basically a consequence of predicted cardinality, precision of available indexes, apparent scattering of data (as indicated by the clustering_factor of the considered indexes) and the model the optimizer has of the I/O characteristics of the hardware (system statistics).
  
  In your case – especially with the partitoined join – there is a reasonable chance that the cardinality calculations have produced estimates that are too low, encouraging Oracle to use an index-driven path because it thinks that it won’t have to run through the index access many times.
  
  For more ideas on how to work out what constitutes a good path, and why Oracle may be choosing a bad path, you could read the article I wrote for Simple-Talk or listen to the web presentation I did for Embarcadero some time ago.
  
  Comment by Jonathan Lewis — August 12, 2010 @ 8:05 am BST Aug 12,2010 | Reply
  - Yes, I have to say, your book “Cost Based Oracle – Fundamentals” is so good, that I say to customers wanting to migrate to 11g: Don’t! For 10g we have JL’s book on the cost based optimizer! For 11g we don’t! (Please write a follow-up…)
    
    Comment by Xenofon Grigoriadis — August 23, 2010 @ 2:12 am BST Aug 23,2010 | Reply
@Ogan — It could be due to the amount of data you are dealing with. If the smaller row set is really small, NL works out better as it avoids “startup” cost. But, if the data sets to join are large enough, HJ could be better. The reason you may be seeing NL could be due to various reasons a) Stats issue (Table,Index stats) b) Bind values used by optmizer to generate the plan were NL favoring, while majority of real values are HJ condusive etc.

Also – you need to quantify your stmt – “it gets better and faster” – Is this just elapsed time, or it also includes LIO, CPU as well.

Comment by Sachin — August 11, 2010 @ 6:07 pm BST Aug 11,2010 | Reply
- Sachin,
  
  “If the smaller row set is really small, NL works out better as it avoids “startup” cost.”
  
  This statement seems to be ignoring the comments I made in the article – perhaps you were only thinking of it in a specific context which you didn’t state clearly.
  
  If the “larger row set” is (say) 10,000,000 rows, then copying 10 rows into a hash table and probing the hash table could be avoiding (potentially) 40,000,000 gets on cache buffers chains latches. The “startup” cost would be well worth that saving.
  
  Comment by Jonathan Lewis — August 12, 2010 @ 7:57 am BST Aug 12,2010 | Reply
  - Jonathan,
    
    Help me clear this doubt.
    
    we have 2 data-sets. one v.large and one v.small – think of one table is having PK (with say 100 rows) and another child table having FK (referencing PK of 1st table) with 10,000,000 rows and im joining these 2 datasets. Wouldnt NL be the logical choice as smaller data-set will be my driving dataset and larger dataset be my driven dataset. In this case, how can HJ be better where there will be a startup cost involved to build hash cluster in memory?
    
    How will this case cache buffer chain latch? Yes if there’s too much of concurrent access and the root block of driven(large) dataset may get hammered due to which we may see cache buffer chain latch. But in case of normal (not high) concurrent access, why will NL cause this wait?
    
    Comment by Vipin — August 12, 2010 @ 7:37 pm BST Aug 12,2010 | Reply
  - “If the “larger row set” is (say) 10,000,000 rows, then copying 10 rows into a hash table and probing the hash table could be avoiding (potentially) 40,000,000 gets on cache buffers chains latches. The “startup” cost would be well worth that saving”
    
    Could you explain this with example?
    
    I feel that can only happen when the larger table (10million rows) is the driving table (for NL) – which is a bleak possibility. If the smaller table/row-source is say 10 rows, I’m scanning the bigger table (with index) 10 times –
    effective LIO = gets from smaller rowset + 10*3(root+branch+leaf) + gets from larger table(guessing 10).
    
    How can I get 40,000,000 gets?
    
    In case of HJ — Oracle need to scan the big data source against the hash table – that may be a cost to incur?
    
    Comment by Sachin — August 16, 2010 @ 7:17 pm BST Aug 16,2010 | Reply
    - Sachin,
      
      If you’re expecting to join 10 rows from a small table to 10 row from a large table, you don’t have a “small data set” and a “large data set” – you have two small data sets and the topic is irrelevant.
      
      If you have Vipin’s data sets: (100 rows in table with PK, 10,000,000 rows in table with FK) then do you want to join them with:
      
      NLJ from small to big
      NLJ from big to small
      NLJ from big to small after (permanently)recreating small as single table hash cluster
      HJ with small as build table
      MJ – but we’ll ignore that one for now
      
      Now you can probably see the sense of comparing NLJ with HJ.
      
      In passing, I was talking about latch gets, not buffer visits – some buffer visits require two latch gets. Depending on version, infrastructure, data patterns, and environment I could probably make the actual number vary from a few tens of thousands to an arbitrarily large number. The 40M is a ball-park figure to indicate order of magnitude in a fairly bland boring system joining a couple of heap tables with indexes.
      
      Comment by Jonathan Lewis — August 17, 2010 @ 11:32 am BST Aug 17,2010
How does creating a temporary store table get around possible data change in the “parent” table, as a CREATE TABLE performs a silent COMMIT? Or am I missing the point?

Comment by Nigel — August 12, 2010 @ 8:22 am BST Aug 12,2010 | Reply
- Nigel,
  
  The sample code is a model helping to demonstrate the internal mechanism used by Oracle, and the creation of the “temporary store table” is an internal operation (think subquery factoring and materializing). I have modified the wording to try and make this more obvious.
  
  Of course, the idea of keeping a subset (by row and column) of some critical data in a copy table with a different structure from the base table is one that I have used successfully in the past to improve query performance, it’s just one of many variations on the theme of materialized views … and there’s always scope for trading performance and correctness when using materialized views.
  
  Comment by Jonathan Lewis — August 12, 2010 @ 8:48 am BST Aug 12,2010 | Reply
Hi Jonathan,
What is NLJ_BATCHING? Can’t find any sane info in the internet.

Comment by Alexandr — August 12, 2010 @ 12:32 pm BST Aug 12,2010 | Reply
- Alexandr,
  
  There are currently three shapes of execution plans you can get for a nested loop join (as of 11g) – the style that we usually saw in earlier versions of Oracle; pre-fetching which appeared in 9i (with the hint (no_)nlj_prefetch in 11g); and batching which appeared on in 11g (with the hint (no_)nlj_batching).
  
  I have some ideas about pre-fetching, but nothing yet about nlj_batching. When I have something to say, I’ll publish it here.
  
  Comment by Jonathan Lewis — August 16, 2010 @ 3:52 pm BST Aug 16,2010 | Reply
- Alexandr,
  Dion has written a nice post about NLJ_BATCHING
  
  Batching NLJ optimization and ordering
  
  Comment by radino — August 16, 2010 @ 5:13 pm BST Aug 16,2010 | Reply
  - Radino,
    Thanks for that – it’s a useful reference that gives us an important clue to where the “batching” appears.
    
    I like his use of the “repeatable test case”.
    
    Comment by Jonathan Lewis — August 17, 2010 @ 11:15 am BST Aug 17,2010 | Reply
  - Radino,
    
    I was just browsing some notes by Alex Fatkulin, and it crossed my mind that perhaps NLJ_BATCHING may be a mechanism that allows the “consistent gets (fastpath)” to appear more frequently.
    
    (Just another thing to add to the list of ideas when I get around to looking at it properly.)
    
    Comment by Jonathan Lewis — August 19, 2010 @ 6:58 pm BST Aug 19,2010 | Reply
“Of course most tables are simple heap table with b-tree indexes, in which we could be walking through the root, branch, and leaf blocks of an index before hitting the table block – which means we could compete for several latches for each row we want to see from the second table in our nested loop”

That is key (pardon the pun !). Before I get to the table block, I may be reading 3 or 4 Index blocks, each time ?!

Hemant K Chitale

Comment by Hemant K Chitale — August 13, 2010 @ 9:51 am BST Aug 13,2010 | Reply
So – Is this conclusive to say that HJ are always better than NL (except when there is almost no concurrency?)

Comment by Vipin — August 13, 2010 @ 10:27 am BST Aug 13,2010 | Reply
- Take note of the paragraph that starts: “What if …”
  
  This is a very important phrase – it helps you to identify the cases when an idea might be a good one, and when it might be a bad one.
  
  Consider the following, for example: “What if there is no cheap efficient way to extract the small data set to build the hash table?”
  
  With that question in front of you, can you answer your own question ? Can you think of a few more cases where the answer to your question is “No”.
  
  Comment by Jonathan Lewis — August 16, 2010 @ 4:06 pm BST Aug 16,2010 | Reply
HI Jonathan,

Thanks for the wonderful article, this was very helpful.

I have few doubts though, may be my oracle knowledge isnt good enough :(

could you please elaborate this statement as in the explanation above?

a hash join is a nested loop join where Oracle has extracted a subset of the data from the thing that was the “inner” table.

I just want to know how does Oracle does this, is it that we create Hash Table and then fire the select query? kindly clarify this.

Thanks,
Kartik

Comment by kartik — August 19, 2010 @ 9:45 am BST Aug 19,2010 | Reply
Jonathon,

I am a bit confused. for hashing you need a hash function who oracle calculates it. secondly if i have two tables T1 containing 10 unique rows and other T2 table containing 1000 non unique rows. will oracle place T1 in PGA memory or T2 or it create a memory hash table with T2 and gets T1 from buffer cache and for each row of T1 gets the subsets of rows from T2, nested loops them and return to us. will you please elobrate?

regards
Amir Riaz

Comment by Amir Riaz — October 17, 2010 @ 7:30 pm BST Oct 17,2010 | Reply
Amir Riaz,

I do not know the hashing function used by Oracle. However the in-memory hash table will have a number of buckets that is a power of 2 (in fact it might even be a power of 4) – and Oracle has a couple of hash functions based on generating value between 1 and 2^N that could be used.

Based on your description there is no guarantee that Oracle would choose to do a hash join – but if it did, and if its predicted cardinalities were realistic assessments of the figures you have given me then it would probably build the hash table from the 10 rows from t1 and probe it with the 1,000 rows from t2.

Comment by Jonathan Lewis — October 19, 2010 @ 11:09 am BST Oct 19,2010 | Reply
so which table should completely fit in PGA’s hash_area_size memory. in above example since T1 is parent table and T2 is child. so T2 was bigger. in simple, if oracle create hash table with T1(10 rows) and probe it with T2. The entire probe input is scanned or computed one row at a time, and for each probe row, the hash key’s value is computed, the corresponding hash bucket is scanned, and the matches are produced. Since T1 is hash table in memory so hash buckets exists in T1 and in this case we will have only one row.

in case we create hash table from T2(100) non unique rows which exists in hash buckets and probe it with T1. i.e take a row from T1 calculate its hashing key and then use the hash key to reach the hash bucket of memory table T2. this theory looks more appealing to me.

Comment by Amir Riaz — October 20, 2010 @ 8:04 pm BST Oct 20,2010 | Reply
- Amir,
  You still haven’t described your scenario with sufficient accuracy.
  
  Consider: if your 10 “unique” rows in T1 are 32KB each and your 1,000 “non-unique” rows in T2 are 12 bytes each, then Oracle might choose to build the hash table from T2.
  
  When you say that the rows in T1 are unique, do you mean that they are unique as far as the join column(s) are concerned – and are the 1,000 “non-unique” rows distributed so that there are roughly 100 rows from T2 to join to each row in T1 ?
  
  As far as the appeal of your hypothesis is concerned, you seem to be thinking only of the speed with which the hash value can be calculated. Have you considered the extra cost (and mechanics) of creating a hash table that contains more data ? Have you considered what happens if the larger data set can’t fit into memory ? Have you considered the cost of “probing a hash bucket” and finding that you have to follow a linked list to another 99 matching values in that bucket – if that 100 to 1 join is the thought behind your example.
  
  Comment by Jonathan Lewis — October 20, 2010 @ 8:54 pm BST Oct 20,2010 | Reply
great,

i was not able to see that aspect. Truely impressed with your way of thinking.

Thanks

Comment by Amir Riaz — October 21, 2010 @ 6:22 pm BST Oct 21,2010 | Reply
[…] are only three join mechanisms used by Oracle: merge join, hash join and nested loop […]

Pingback by Joins | Oracle Scratchpad — June 5, 2014 @ 8:02 am BST Jun 5,2014 | Reply
[…] For more information check out Jonathan Lewis-Joins – HJ […]

Pingback by Hash Joins | jagdeepsangwan — June 23, 2014 @ 10:31 am BST Jun 23,2014 | Reply
[…] interesting line is operation 4. A hash join takes the rowsource from its first child (the build table) and creates an in-memory hash table […]

Pingback by IOT Hash | Oracle Scratchpad — October 31, 2019 @ 2:59 pm GMT Oct 31,2019 | Reply