Oracle Scratchpad

March 1, 2019

Hash Optimisation-

Filed under: Oracle,Parallel Execution,Performance,trace files — Jonathan Lewis @ 4:39 pm BST Mar 1,2019

Franck Pachot did an interesting presentation at the OBUG (Belgium user group) Tech Days showing how to use one of the O/S debug/trace tools to step through the function calls that Oracle made during different types of joins. This prompted me to ask him a question about a possible optimisation of hash joins as follows:

The hash join operation creates an in-memory hash table from the rowsource produced by its first child operation then probes the hash table with rows from the row source produced by the second child operation; but if there are no rows in the first row source then there’s no need to acquire rows from the second row source, so Oracle doesn’t call the second child operation.

Does the hash join adopt a “symmetrical” strategy – starting the second child operation prematurely to find the first couple of rows before executing the first child operation and building the in-memory hash table ?

Franck got back to me before the end of the event with the answer – which was “no”. (This makes sense, of course, because if there’s no data in the second rowsource – which is supposed to be the “big” rowsource – you’ve probably done a lot of work which you may not have needed to do if you’d checked the first rowsource properly first.  (Thought – might Oracle look for the first row from the first row source, then check the second row source for a first row, then finish the first rowsource and build? How difficult do you want to make your life?)

So Franck got me an answer by looking at function calls – but I’d only asked him because I thought he might have done the experiment already and might be able to give me an answer immediately rather than having to waste his valuable Belgian Beer Drinking time to do some tests for me. (And I’ve never got around to playing with strace, ptrace, dtrace, truss, et. al. so I didn’t fancy doing the tracing for myself, of course). But this particular question doesn’t need any examination of function calls to get an answer: an extended trace file would be sufficient. Here’s how to start:


rem
rem     Script:         hash_join_opt.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2019
rem
rem     Last tested
rem             11.2.0.4
rem

create table t1
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,20)                  n20,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

create table t2
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        mod(rownum,10)                  n10,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1'
        );
end;
/

set serveroutput off

alter system flush buffer_cache;
execute dbms_lock.sleep(2)

alter session set statistics_level = all;
alter session set events '10046 trace name context forever, level 8';

select
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7.5
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last'));
alter session set events '10046 trace name context off';
alter session set statistics_level = typical;

I’ve created a pair of tables with similar volumes of data and then executed a query that should do a hash join between the tables. The t2 table will return no rows since the predicate on n10 is asking for a value that doesn’t exist but the optimizer will think it will supply 10% of the table since I’ve avoided creating a histogram on the column and the num_distinct for the n10 column is 10. You’ll notice that I’ve flushed the buffer_cache (with a couple of seconds pause, just in case) before executing the query. Here’s the execution plan with rowsource execution stats:


SQL_ID  bscu1r7cu36ur, child number 0
-------------------------------------
select  t1.v1, t2.v1 from  t1, t2 where  t1.n20 = 19 and t2.id = t1.id
and t2.n10 = 7.5

Plan hash value: 1838229974

-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |      0 |00:00:00.04 |     352 |    348 |       |       |          |
|*  1 |  HASH JOIN         |      |      1 |    500 |      0 |00:00:00.04 |     352 |    348 |  1519K|  1519K| 1487K (0)|
|*  2 |   TABLE ACCESS FULL| T1   |      1 |    500 |    500 |00:00:00.03 |     176 |    174 |       |       |          |
|*  3 |   TABLE ACCESS FULL| T2   |      1 |   1000 |      0 |00:00:00.01 |     176 |    174 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("T2"."ID"="T1"."ID")
   2 - filter("T1"."N20"=19)
   3 - filter("T2"."N10"=7.5)

As you can see, the tablescan of t2 returned no rows so the hash join returned no rows, but the stats show 174 blocks read from both t1 and t2. The fact that we read 174 blocks from t2 doesn’t mean we didn’t do a pre-emptive check, of course, as those reads could (by the hypothesis) have been the reads required by the pre-emptive check. The fact that we did a complete scan both tables, though, tells us that the reads on t2 (probably) weren’t from a pre-emptive check (if they were then Oracle shouldn’t have scanned the whole of t1).

We can get extra corroborative evidence though when we check the trace file – which shows us the following waits:

PARSING IN CURSOR #139927916187824 len=88 dep=0 uid=62 oct=3 lid=62 tim=1551445144496821 hv=3651246935 ad='97353e10' sqlid='bscu1r7cu36ur'
select
        t1.v1, t2.v1
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7.5
END OF STMT

PARSE #139927916187824:c=3000,e=2760,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=1838229974,tim=1551445144496811
EXEC #139927916187824:c=0,e=78,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144497155
WAIT #139927916187824: nam='SQL*Net message to client' ela= 8 driver id=1650815232 #bytes=1 p3=0 obj#=351199 tim=1551445144497224

WAIT #139927916187824: nam='db file sequential read' ela= 634 file#=5 block#=130 blocks=1 obj#=351198 tim=1551445144498386
WAIT #139927916187824: nam='db file scattered read' ela= 397 file#=5 block#=131 blocks=5 obj#=351198 tim=1551445144498971
WAIT #139927916187824: nam='db file scattered read' ela= 1752 file#=5 block#=136 blocks=8 obj#=351198 tim=1551445144502411
WAIT #139927916187824: nam='db file scattered read' ela= 426 file#=5 block#=145 blocks=7 obj#=351198 tim=1551445144504056
WAIT #139927916187824: nam='db file scattered read' ela= 590 file#=5 block#=152 blocks=8 obj#=351198 tim=1551445144505655
WAIT #139927916187824: nam='db file scattered read' ela= 8063 file#=5 block#=161 blocks=7 obj#=351198 tim=1551445144514574
WAIT #139927916187824: nam='db file scattered read' ela= 402 file#=5 block#=168 blocks=8 obj#=351198 tim=1551445144515741
WAIT #139927916187824: nam='db file scattered read' ela= 803 file#=5 block#=177 blocks=7 obj#=351198 tim=1551445144517495
WAIT #139927916187824: nam='db file scattered read' ela= 423 file#=5 block#=184 blocks=8 obj#=351198 tim=1551445144518743
WAIT #139927916187824: nam='db file scattered read' ela= 422 file#=5 block#=193 blocks=7 obj#=351198 tim=1551445144520696
WAIT #139927916187824: nam='db file scattered read' ela= 266 file#=5 block#=200 blocks=8 obj#=351198 tim=1551445144521460
WAIT #139927916187824: nam='db file scattered read' ela= 307 file#=5 block#=209 blocks=7 obj#=351198 tim=1551445144522249
WAIT #139927916187824: nam='db file scattered read' ela= 313 file#=5 block#=216 blocks=8 obj#=351198 tim=1551445144523166
WAIT #139927916187824: nam='db file scattered read' ela= 247 file#=5 block#=225 blocks=7 obj#=351198 tim=1551445144523927
WAIT #139927916187824: nam='db file scattered read' ela= 315 file#=5 block#=232 blocks=8 obj#=351198 tim=1551445144524646
WAIT #139927916187824: nam='db file scattered read' ela= 326 file#=5 block#=241 blocks=7 obj#=351198 tim=1551445144525565
WAIT #139927916187824: nam='db file scattered read' ela= 186 file#=5 block#=248 blocks=8 obj#=351198 tim=1551445144526255
WAIT #139927916187824: nam='db file scattered read' ela= 341 file#=5 block#=258 blocks=55 obj#=351198 tim=1551445144527363
----------------
WAIT #139927916187824: nam='db file sequential read' ela= 201 file#=5 block#=386 blocks=1 obj#=351199 tim=1551445144531165
WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=387 blocks=5 obj#=351199 tim=1551445144531492
WAIT #139927916187824: nam='db file scattered read' ela= 194 file#=5 block#=392 blocks=8 obj#=351199 tim=1551445144531750
WAIT #139927916187824: nam='db file scattered read' ela= 301 file#=5 block#=401 blocks=7 obj#=351199 tim=1551445144532271
WAIT #139927916187824: nam='db file scattered read' ela= 272 file#=5 block#=408 blocks=8 obj#=351199 tim=1551445144532756
WAIT #139927916187824: nam='db file scattered read' ela= 258 file#=5 block#=417 blocks=7 obj#=351199 tim=1551445144533218
WAIT #139927916187824: nam='db file scattered read' ela= 242 file#=5 block#=424 blocks=8 obj#=351199 tim=1551445144533704
WAIT #139927916187824: nam='db file scattered read' ela= 232 file#=5 block#=433 blocks=7 obj#=351199 tim=1551445144534125
WAIT #139927916187824: nam='db file scattered read' ela= 213 file#=5 block#=440 blocks=8 obj#=351199 tim=1551445144534506
WAIT #139927916187824: nam='db file scattered read' ela= 241 file#=5 block#=449 blocks=7 obj#=351199 tim=1551445144534914
WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=456 blocks=8 obj#=351199 tim=1551445144535258
WAIT #139927916187824: nam='db file scattered read' ela= 311 file#=5 block#=465 blocks=7 obj#=351199 tim=1551445144536518
WAIT #139927916187824: nam='db file scattered read' ela= 215 file#=5 block#=472 blocks=8 obj#=351199 tim=1551445144536906
WAIT #139927916187824: nam='db file scattered read' ela= 164 file#=5 block#=481 blocks=7 obj#=351199 tim=1551445144537343
WAIT #139927916187824: nam='db file scattered read' ela= 160 file#=5 block#=488 blocks=8 obj#=351199 tim=1551445144537624
WAIT #139927916187824: nam='db file scattered read' ela= 253 file#=5 block#=497 blocks=7 obj#=351199 tim=1551445144538041
WAIT #139927916187824: nam='db file scattered read' ela= 252 file#=5 block#=504 blocks=8 obj#=351199 tim=1551445144538456
WAIT #139927916187824: nam='db file scattered read' ela= 947 file#=5 block#=514 blocks=55 obj#=351199 tim=1551445144539690

FETCH #139927916187824:c=25996,e=43227,p=348,cr=352,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144540502

I’ve inserted a few gaps into this section of the trace file but haven’t deleted any lines. As you can see there’s a series of reads for obj# 351198 (table t1), followed by a series of reads of obj# 351199. We don’t do anything subtle like peeking at t1, switching to peek at t2, then continuing with t1: it’s a simple end to end brute force scan of each table in turn.

But there’s more …

Before I created the test above I decided to check whether I had already done the test some time in the past and written about it. A search through my blog didn’t turn up any notes about this question, but it did uncover the following footnote to an earlier posting about hash joins:

taking a closer look at the set of trace files generated in the broadcast test I discovered that the first set of slaves start their parallel tablescan of t1 but stop after just one read from each slave, then the second set of slaves scans and builds the hash table from t2 before calling for further data from t1.

Naturally I included a link to the footnote in the draft notes for this blog entry and promptly forgot about writing the blog note until a couple of days ago when Chinar Aliyev produced a note on the hash join buffered operation in which he had included an execution plan similar to the broadcast plan from my earlier blog note – so I thought I’d take another look at it, because it looks like some versions of the parallel hash join can do exactly the pre-emptive test that the serial execution plan doesn’t.

So, here’s a query to run against the same data set, and the resulting execution plan – pulled from memory after executing the query:


select
        /*+
                parallel(2)
                leading(t1 t2)
                use_hash(t2)
                pq_distribute(t2 none broadcast)
        */
        t1.v1, t2.v1
--      t1.v1, count(*)
from
        t1, t2
where
        t1.n20 = 19
and     t2.id = t1.id
and     t2.n10 = 7
;

----------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows |    TQ  |IN-OUT| PQ Distrib | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
----------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |        |      |            |      0 |00:00:00.22 |      10 |      2 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |        |      |            |      0 |00:00:00.22 |      10 |      2 |       |       |          |
|   2 |   PX SEND QC (RANDOM)    | :TQ10001 |      0 |    500 |  Q1,01 | P->S | QC (RAND)  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN             |          |      1 |    500 |  Q1,01 | PCWP |            |      0 |00:00:00.12 |     128 |    173 |  1483K|  1483K|     2/0/0|
|   4 |     PX BLOCK ITERATOR    |          |      2 |    500 |  Q1,01 | PCWC |            |    500 |00:00:00.03 |     248 |    173 |       |       |          |
|*  5 |      TABLE ACCESS FULL   | T1       |     25 |    500 |  Q1,01 | PCWP |            |    500 |00:00:00.02 |     248 |    173 |       |       |          |
|   6 |     BUFFER SORT          |          |      2 |        |  Q1,01 | PCWC |            |   2000 |00:00:00.07 |       0 |      0 |   108K|   108K|     2/0/0|
|   7 |      PX RECEIVE          |          |      2 |   1000 |  Q1,01 | PCWP |            |   2000 |00:00:00.06 |       0 |      0 |       |       |          |
|   8 |       PX SEND BROADCAST  | :TQ10000 |      0 |   1000 |  Q1,00 | P->P | BROADCAST  |      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   9 |        PX BLOCK ITERATOR |          |      2 |   1000 |  Q1,00 | PCWC |            |   1000 |00:00:00.05 |     248 |    173 |       |       |          |
|* 10 |         TABLE ACCESS FULL| T2       |     25 |   1000 |  Q1,00 | PCWP |            |   1000 |00:00:00.04 |     248 |    173 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Now the rule for reading a DFO tree in a parallel plan is to follow the table queues (a.k.a. DFO numbers / TQ numbers) – but in this case we have a bit of a mixture as the table scan of t1 isn’t used as the input to a distribution so it’s hard to know whether the first physical activity is supposed to be the tablescan of t2 (feeding the zeroth table queue) or the tablescan of t1 which doesn’t feed a table queue until after the hash join.

Given that t2 is being broadcast we should probably expect to see the first set of parallel query slaves starting first with a scan of “random” sections of t1 to build an in-memory hash table, and then see the second set of parallel query slaves scanning t2 and broadcasting the results to the first set of slaves (every slave gets a copy of every row) to do the probe. We can try to confirm this by looking at the trace files. First we extract “db file scattered read” lines from the four trace files and then we sort them by timestamp.

test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 677 file#=5 block#=220 blocks=4 obj#=351218 tim=1551450113700076
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 214 file#=5 block#=225 blocks=3 obj#=351218 tim=1551450113701131
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 238 file#=5 block#=161 blocks=7 obj#=351218 tim=1551450113702026
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 318 file#=5 block#=177 blocks=6 obj#=351218 tim=1551450113703464
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 371 file#=5 block#=190 blocks=2 obj#=351218 tim=1551450113705168
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 227 file#=5 block#=193 blocks=5 obj#=351218 tim=1551450113705690
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 225 file#=5 block#=138 blocks=6 obj#=351218 tim=1551450113706850
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 487 file#=5 block#=131 blocks=5 obj#=351218 tim=1551450113708512
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 228 file#=5 block#=136 blocks=2 obj#=351218 tim=1551450113709161
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 219 file#=5 block#=294 blocks=7 obj#=351218 tim=1551450113710099
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 150 file#=5 block#=184 blocks=6 obj#=351218 tim=1551450113711398
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 229 file#=5 block#=273 blocks=7 obj#=351218 tim=1551450113712456
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 223 file#=5 block#=280 blocks=7 obj#=351218 tim=1551450113713829
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 253 file#=5 block#=168 blocks=7 obj#=351218 tim=1551450113715124
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 1014 file#=5 block#=308 blocks=5 obj#=351218 tim=1551450113717046
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 266 file#=5 block#=235 blocks=5 obj#=351218 tim=1551450113718684
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 208 file#=5 block#=241 blocks=2 obj#=351218 tim=1551450113719321
----
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 326 file#=5 block#=287 blocks=7 obj#=351218 tim=1551450113700003
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 147 file#=5 block#=198 blocks=2 obj#=351218 tim=1551450113701394
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 72 file#=5 block#=200 blocks=5 obj#=351218 tim=1551450113701840
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 459 file#=5 block#=153 blocks=7 obj#=351218 tim=1551450113703442
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 308 file#=5 block#=259 blocks=7 obj#=351218 tim=1551450113705132
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 313 file#=5 block#=266 blocks=7 obj#=351218 tim=1551450113706540
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 304 file#=5 block#=243 blocks=5 obj#=351218 tim=1551450113707925
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 219 file#=5 block#=248 blocks=2 obj#=351218 tim=1551450113708505
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 590 file#=5 block#=228 blocks=4 obj#=351218 tim=1551450113709705
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 178 file#=5 block#=232 blocks=3 obj#=351218 tim=1551450113710270
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 142 file#=5 block#=205 blocks=3 obj#=351218 tim=1551450113711046
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 98 file#=5 block#=209 blocks=4 obj#=351218 tim=1551450113711457
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 179 file#=5 block#=213 blocks=3 obj#=351218 tim=1551450113712308
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 296 file#=5 block#=216 blocks=4 obj#=351218 tim=1551450113712967
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 188 file#=5 block#=301 blocks=7 obj#=351218 tim=1551450113714238
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 207 file#=5 block#=250 blocks=6 obj#=351218 tim=1551450113715393
test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 567 file#=5 block#=146 blocks=6 obj#=351218 tim=1551450113717686
----
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 196 file#=5 block#=522 blocks=7 obj#=351219 tim=1551450113722013
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 145 file#=5 block#=564 blocks=5 obj#=351219 tim=1551450113723239
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 317 file#=5 block#=491 blocks=5 obj#=351219 tim=1551450113724578
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 128 file#=5 block#=497 blocks=2 obj#=351219 tim=1551450113725217
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 495 file#=5 block#=402 blocks=6 obj#=351219 tim=1551450113726250
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 168 file#=5 block#=394 blocks=6 obj#=351219 tim=1551450113727928
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 249 file#=5 block#=424 blocks=7 obj#=351219 tim=1551450113730384
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 120 file#=5 block#=476 blocks=4 obj#=351219 tim=1551450113731225
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 485 file#=5 block#=481 blocks=3 obj#=351219 tim=1551450113732064
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 594 file#=5 block#=387 blocks=5 obj#=351219 tim=1551450113734168
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 323 file#=5 block#=392 blocks=2 obj#=351219 tim=1551450113735040
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 121 file#=5 block#=506 blocks=6 obj#=351219 tim=1551450113736253
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 285 file#=5 block#=440 blocks=6 obj#=351219 tim=1551450113739084
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 345 file#=5 block#=433 blocks=6 obj#=351219 tim=1551450113740940
----
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 155 file#=5 block#=484 blocks=4 obj#=351219 tim=1551450113721738
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 131 file#=5 block#=488 blocks=3 obj#=351219 tim=1551450113722260
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 143 file#=5 block#=557 blocks=7 obj#=351219 tim=1551450113723245
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 138 file#=5 block#=454 blocks=2 obj#=351219 tim=1551450113724545
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 124 file#=5 block#=456 blocks=5 obj#=351219 tim=1551450113724954
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 243 file#=5 block#=469 blocks=3 obj#=351219 tim=1551450113725947
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 523 file#=5 block#=472 blocks=4 obj#=351219 tim=1551450113726745
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 178 file#=5 block#=409 blocks=7 obj#=351219 tim=1551450113727781
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 230 file#=5 block#=543 blocks=7 obj#=351219 tim=1551450113729470
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 488 file#=5 block#=515 blocks=7 obj#=351219 tim=1551450113731618
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 356 file#=5 block#=461 blocks=3 obj#=351219 tim=1551450113733840
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 185 file#=5 block#=465 blocks=4 obj#=351219 tim=1551450113735195
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 284 file#=5 block#=536 blocks=7 obj#=351219 tim=1551450113736172
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 204 file#=5 block#=499 blocks=5 obj#=351219 tim=1551450113737957
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 306 file#=5 block#=504 blocks=2 obj#=351219 tim=1551450113738871
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 281 file#=5 block#=446 blocks=2 obj#=351219 tim=1551450113739710
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 158 file#=5 block#=449 blocks=5 obj#=351219 tim=1551450113740159
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 99 file#=5 block#=417 blocks=7 obj#=351219 tim=1551450113741709


tim=1551450113698677 obj#=351219
tim=1551450113698687 obj#=351219
----
tim=1551450113700003 obj#=351218
tim=1551450113700076 obj#=351218
tim=1551450113701131 obj#=351218
tim=1551450113701394 obj#=351218
tim=1551450113701840 obj#=351218
tim=1551450113702026 obj#=351218
tim=1551450113703442 obj#=351218
tim=1551450113703464 obj#=351218
tim=1551450113705132 obj#=351218
tim=1551450113705168 obj#=351218
tim=1551450113705690 obj#=351218
tim=1551450113706540 obj#=351218
tim=1551450113706850 obj#=351218
tim=1551450113707925 obj#=351218
tim=1551450113708505 obj#=351218
tim=1551450113708512 obj#=351218
tim=1551450113709161 obj#=351218
tim=1551450113709705 obj#=351218
tim=1551450113710099 obj#=351218
tim=1551450113710270 obj#=351218
tim=1551450113711046 obj#=351218
tim=1551450113711398 obj#=351218
tim=1551450113711457 obj#=351218
tim=1551450113712308 obj#=351218
tim=1551450113712456 obj#=351218
tim=1551450113712967 obj#=351218
tim=1551450113713829 obj#=351218
tim=1551450113714238 obj#=351218
tim=1551450113715124 obj#=351218
tim=1551450113715393 obj#=351218
tim=1551450113717046 obj#=351218
tim=1551450113717686 obj#=351218
tim=1551450113718684 obj#=351218
tim=1551450113719321 obj#=351218
--
tim=1551450113721738 obj#=351219
tim=1551450113722013 obj#=351219
tim=1551450113722260 obj#=351219
tim=1551450113723239 obj#=351219
tim=1551450113723245 obj#=351219
tim=1551450113724545 obj#=351219
tim=1551450113724578 obj#=351219
tim=1551450113724954 obj#=351219
tim=1551450113725217 obj#=351219
tim=1551450113725947 obj#=351219
tim=1551450113726250 obj#=351219
tim=1551450113726745 obj#=351219
tim=1551450113727781 obj#=351219
tim=1551450113727928 obj#=351219
tim=1551450113729470 obj#=351219
tim=1551450113730384 obj#=351219
tim=1551450113731225 obj#=351219
tim=1551450113731618 obj#=351219
tim=1551450113732064 obj#=351219
tim=1551450113733840 obj#=351219
tim=1551450113734168 obj#=351219
tim=1551450113735040 obj#=351219
tim=1551450113735195 obj#=351219
tim=1551450113736172 obj#=351219
tim=1551450113736253 obj#=351219
tim=1551450113737957 obj#=351219
tim=1551450113738871 obj#=351219
tim=1551450113739084 obj#=351219
tim=1551450113739710 obj#=351219
tim=1551450113740159 obj#=351219
tim=1551450113740940 obj#=351219
tim=1551450113741709 obj#=351219

I’ve got four slave processes p000 through to p0003, and in the top part of the output you can see that p000/p001 scan object 351218 (after several experiments the object_id for t1 has climbed a bit), and p002/p003 scan object 351219 (t2). Sorting by the timestamps (tim=), though, we can see that there are two reads for t2 (351219) followed by many reads of t1 (351218), finishing with a load of reads of t2. Cross-checking the timestamps on the sorted output with the original extract we can see that the two initial reads of t2 are the first reads by p002 and p003 respectively.

tim=1551450113698677 obj#=351219
test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677

tim=1551450113698687 obj#=351219
test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687

It really does look as if at run time Oracle does the scan and build for t1 first but starts with a quick peek at t2 to check that it’s going to return some data. We can be a little more fussy with this test – let’s change some of the data in t2 so that Oracle doesn’t find any rows in the first two scattered reads and see if anything changes.

This is going to take a little fiddling but in my case I can see that the first reads by slaves p002/p003 are looking at file 5, blocks 529 – 535 and blocks 550 – 556; and broadly speaking the scattered reads are working backwards down the table – so I’m going to update a load of data that appears in the second half of the table on the next run of the test with a simple command to put the n10 column value out of range:


update t2 set n10 = 0 where n10 = 7 and id > 750;
commit;

-- test the query

update t2 set n10 = 0 where n10 = 7;
commit;

-- test the query

After the first update the pattern of reads (in time order) changed do:

  • 22 scattered reads of t2
  • 34 scattered reads of t1
  • 12 scattered reads of t2

This really does look like Oracle checking whether t2 will have any data before doing the full scan and build from t1.

Unfortunately when I did the final update that eliminated all the interesting data from t2 the pattern (in time order) turned into:

  • 34 scattered reads of t2
  • 34 scattered reads of t1

In other words – having discovered NO data in t2 that could join Oracle still scanned and built the hash table from t1. So maybe there’s a reason other than a sanity check for peeking at the second table, or maybe there’s a bug in the extreme case where the second table produces no data at all. (For reference you might also want to see an older note on how a parallel hash join doesn’t skip the probe table when the build table is empty.)

Just as a quick check that I wasn’t chasing an old bug at this point I repeated the tests on 18.3.0.0 – the same patterns emerged.

 

1 Comment »

  1. My tweet announcing this post resulted in some interesting discussion about how Postgres handles hash joins (including the painful discovery that Oracle and Postgres present the build and hash tables the opposite way up in execution plans). The thread starts here.

    The post also resulted in a private email conversation which focused on the boundary cases of the optimisation – starting with the question:

    What if you spend a huge amount of effort scanning the probe table before finding any data, and then scan the build table and it takes no time to find that the hash table will be empty ?

    The guide to perfect design says you should work out all the boundary conditions and special cases before starting to create the program. (Though an alternative guide to perfect programming says you should code (and then destroy) the first approximation to find out how hard the simple bit is going to be before you try to cater for the complete requirement.) In the case of the hash join the underlying question is how to avoid doing work that could have been avoided if only you’d known in advance there was no data.

    So – the boundary conditions are:

    • The build table has no data (in which case you could avoid the work of searching for probe data – Oracle already does this for serial hash joins, but not for the parallel case.)
    • The probe table has no data (in which case you could avoid the work of searching for build data and building the hassh table).

    An important point to remember when reviewing the problem is that the (Oracle) optimizer will select the build and probe base on volume of data, the decision isn’t based on comparing the costs of acquiring the individual rowsources. (And having said that I wonder how many more cases I ought to test to see if this is a simplification of a more subtle rule.)

    So a more careful strategy for getting as lucky as possible seems to be:

    • Start scanning the lower cost table until you find the first row – if you complete the scan without finding data there’s no need to worry about the second table. (I’m going to ignore variations for dealing with left, right, and full outer joins in this comment.)
    • If you’ve found a row in the lower cost table, start scanning the higher cost table until you find a row – if you complete the scan without finding any data there’s no need to continue with the hash join.
    • If you’ve found a row in the higher cost table, continue scanning the table that’s been nominated as the build table, build the hash table, then continue with the table that’s been nominated as the probe table.

    I suppose a refinement on this would be to consider calculating the cost of finding the first row from each table and choosing the first scan based on the first_rows(1) costing – but the I’ve been saying “table” when really the two child operations could be complex sub-plans that would require a lot of work to calculate first row costs.

    Comment by Jonathan Lewis — March 4, 2019 @ 10:47 am BST Mar 4,2019 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.