Franck Pachot did an interesting presentation at the OBUG (Belgium user group) Tech Days showing how to use one of the O/S debug/trace tools to step through the function calls that Oracle made during different types of joins. This prompted me to ask him a question about a possible optimisation of hash joins as follows:
The hash join operation creates an in-memory hash table from the rowsource produced by its first child operation then probes the hash table with rows from the row source produced by the second child operation; but if there are no rows in the first row source then there’s no need to acquire rows from the second row source, so Oracle doesn’t call the second child operation.
Does the hash join adopt a “symmetrical” strategy – starting the second child operation prematurely to find the first couple of rows before executing the first child operation and building the in-memory hash table ?
Franck got back to me before the end of the event with the answer – which was “no”. (This makes sense, of course, because if there’s no data in the second rowsource – which is supposed to be the “big” rowsource – you’ve probably done a lot of work which you may not have needed to do if you’d checked the first rowsource properly first. (Thought – might Oracle look for the first row from the first row source, then check the second row source for a first row, then finish the first rowsource and build? How difficult do you want to make your life?)
So Franck got me an answer by looking at function calls – but I’d only asked him because I thought he might have done the experiment already and might be able to give me an answer immediately rather than having to waste his valuable Belgian Beer Drinking time to do some tests for me. (And I’ve never got around to playing with strace, ptrace, dtrace, truss, et. al. so I didn’t fancy doing the tracing for myself, of course). But this particular question doesn’t need any examination of function calls to get an answer: an extended trace file would be sufficient. Here’s how to start:
rem rem Script: hash_join_opt.sql rem Author: Jonathan Lewis rem Dated: Feb 2019 rem rem Last tested rem 11.2.0.4 rem create table t1 segment creation immediate nologging as with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum,20) n20, lpad(rownum,10,'0') v1, lpad('x',100,'x') padding from generator v1, generator v2 where rownum <= 1e4 -- > comment to avoid WordPress format issue ; create table t2 segment creation immediate nologging as with generator as ( select rownum id from dual connect by level <= 1e4 -- > comment to avoid WordPress format issue ) select rownum id, mod(rownum,10) n10, lpad(rownum,10,'0') v1, lpad('x',100,'x') padding from generator v1, generator v2 where rownum <= 1e4 -- > comment to avoid WordPress format issue ; begin dbms_stats.gather_table_stats( ownname => null, tabname => 'T1', method_opt => 'for all columns size 1' ); dbms_stats.gather_table_stats( ownname => null, tabname => 'T2', method_opt => 'for all columns size 1' ); end; / set serveroutput off alter system flush buffer_cache; execute dbms_lock.sleep(2) alter session set statistics_level = all; alter session set events '10046 trace name context forever, level 8'; select t1.v1, t2.v1 from t1, t2 where t1.n20 = 19 and t2.id = t1.id and t2.n10 = 7.5 ; select * from table(dbms_xplan.display_cursor(null,null,'allstats last')); alter session set events '10046 trace name context off'; alter session set statistics_level = typical;
I’ve created a pair of tables with similar volumes of data and then executed a query that should do a hash join between the tables. The t2 table will return no rows since the predicate on n10 is asking for a value that doesn’t exist but the optimizer will think it will supply 10% of the table since I’ve avoided creating a histogram on the column and the num_distinct for the n10 column is 10. You’ll notice that I’ve flushed the buffer_cache (with a couple of seconds pause, just in case) before executing the query. Here’s the execution plan with rowsource execution stats:
SQL_ID bscu1r7cu36ur, child number 0 ------------------------------------- select t1.v1, t2.v1 from t1, t2 where t1.n20 = 19 and t2.id = t1.id and t2.n10 = 7.5 Plan hash value: 1838229974 ------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-Mem | ------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 0 |00:00:00.04 | 352 | 348 | | | | |* 1 | HASH JOIN | | 1 | 500 | 0 |00:00:00.04 | 352 | 348 | 1519K| 1519K| 1487K (0)| |* 2 | TABLE ACCESS FULL| T1 | 1 | 500 | 500 |00:00:00.03 | 176 | 174 | | | | |* 3 | TABLE ACCESS FULL| T2 | 1 | 1000 | 0 |00:00:00.01 | 176 | 174 | | | | ------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access("T2"."ID"="T1"."ID") 2 - filter("T1"."N20"=19) 3 - filter("T2"."N10"=7.5)
As you can see, the tablescan of t2 returned no rows so the hash join returned no rows, but the stats show 174 blocks read from both t1 and t2. The fact that we read 174 blocks from t2 doesn’t mean we didn’t do a pre-emptive check, of course, as those reads could (by the hypothesis) have been the reads required by the pre-emptive check. The fact that we did a complete scan both tables, though, tells us that the reads on t2 (probably) weren’t from a pre-emptive check (if they were then Oracle shouldn’t have scanned the whole of t1).
We can get extra corroborative evidence though when we check the trace file – which shows us the following waits:
PARSING IN CURSOR #139927916187824 len=88 dep=0 uid=62 oct=3 lid=62 tim=1551445144496821 hv=3651246935 ad='97353e10' sqlid='bscu1r7cu36ur' select t1.v1, t2.v1 from t1, t2 where t1.n20 = 19 and t2.id = t1.id and t2.n10 = 7.5 END OF STMT PARSE #139927916187824:c=3000,e=2760,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=1838229974,tim=1551445144496811 EXEC #139927916187824:c=0,e=78,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144497155 WAIT #139927916187824: nam='SQL*Net message to client' ela= 8 driver id=1650815232 #bytes=1 p3=0 obj#=351199 tim=1551445144497224 WAIT #139927916187824: nam='db file sequential read' ela= 634 file#=5 block#=130 blocks=1 obj#=351198 tim=1551445144498386 WAIT #139927916187824: nam='db file scattered read' ela= 397 file#=5 block#=131 blocks=5 obj#=351198 tim=1551445144498971 WAIT #139927916187824: nam='db file scattered read' ela= 1752 file#=5 block#=136 blocks=8 obj#=351198 tim=1551445144502411 WAIT #139927916187824: nam='db file scattered read' ela= 426 file#=5 block#=145 blocks=7 obj#=351198 tim=1551445144504056 WAIT #139927916187824: nam='db file scattered read' ela= 590 file#=5 block#=152 blocks=8 obj#=351198 tim=1551445144505655 WAIT #139927916187824: nam='db file scattered read' ela= 8063 file#=5 block#=161 blocks=7 obj#=351198 tim=1551445144514574 WAIT #139927916187824: nam='db file scattered read' ela= 402 file#=5 block#=168 blocks=8 obj#=351198 tim=1551445144515741 WAIT #139927916187824: nam='db file scattered read' ela= 803 file#=5 block#=177 blocks=7 obj#=351198 tim=1551445144517495 WAIT #139927916187824: nam='db file scattered read' ela= 423 file#=5 block#=184 blocks=8 obj#=351198 tim=1551445144518743 WAIT #139927916187824: nam='db file scattered read' ela= 422 file#=5 block#=193 blocks=7 obj#=351198 tim=1551445144520696 WAIT #139927916187824: nam='db file scattered read' ela= 266 file#=5 block#=200 blocks=8 obj#=351198 tim=1551445144521460 WAIT #139927916187824: nam='db file scattered read' ela= 307 file#=5 block#=209 blocks=7 obj#=351198 tim=1551445144522249 WAIT #139927916187824: nam='db file scattered read' ela= 313 file#=5 block#=216 blocks=8 obj#=351198 tim=1551445144523166 WAIT #139927916187824: nam='db file scattered read' ela= 247 file#=5 block#=225 blocks=7 obj#=351198 tim=1551445144523927 WAIT #139927916187824: nam='db file scattered read' ela= 315 file#=5 block#=232 blocks=8 obj#=351198 tim=1551445144524646 WAIT #139927916187824: nam='db file scattered read' ela= 326 file#=5 block#=241 blocks=7 obj#=351198 tim=1551445144525565 WAIT #139927916187824: nam='db file scattered read' ela= 186 file#=5 block#=248 blocks=8 obj#=351198 tim=1551445144526255 WAIT #139927916187824: nam='db file scattered read' ela= 341 file#=5 block#=258 blocks=55 obj#=351198 tim=1551445144527363 ---------------- WAIT #139927916187824: nam='db file sequential read' ela= 201 file#=5 block#=386 blocks=1 obj#=351199 tim=1551445144531165 WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=387 blocks=5 obj#=351199 tim=1551445144531492 WAIT #139927916187824: nam='db file scattered read' ela= 194 file#=5 block#=392 blocks=8 obj#=351199 tim=1551445144531750 WAIT #139927916187824: nam='db file scattered read' ela= 301 file#=5 block#=401 blocks=7 obj#=351199 tim=1551445144532271 WAIT #139927916187824: nam='db file scattered read' ela= 272 file#=5 block#=408 blocks=8 obj#=351199 tim=1551445144532756 WAIT #139927916187824: nam='db file scattered read' ela= 258 file#=5 block#=417 blocks=7 obj#=351199 tim=1551445144533218 WAIT #139927916187824: nam='db file scattered read' ela= 242 file#=5 block#=424 blocks=8 obj#=351199 tim=1551445144533704 WAIT #139927916187824: nam='db file scattered read' ela= 232 file#=5 block#=433 blocks=7 obj#=351199 tim=1551445144534125 WAIT #139927916187824: nam='db file scattered read' ela= 213 file#=5 block#=440 blocks=8 obj#=351199 tim=1551445144534506 WAIT #139927916187824: nam='db file scattered read' ela= 241 file#=5 block#=449 blocks=7 obj#=351199 tim=1551445144534914 WAIT #139927916187824: nam='db file scattered read' ela= 221 file#=5 block#=456 blocks=8 obj#=351199 tim=1551445144535258 WAIT #139927916187824: nam='db file scattered read' ela= 311 file#=5 block#=465 blocks=7 obj#=351199 tim=1551445144536518 WAIT #139927916187824: nam='db file scattered read' ela= 215 file#=5 block#=472 blocks=8 obj#=351199 tim=1551445144536906 WAIT #139927916187824: nam='db file scattered read' ela= 164 file#=5 block#=481 blocks=7 obj#=351199 tim=1551445144537343 WAIT #139927916187824: nam='db file scattered read' ela= 160 file#=5 block#=488 blocks=8 obj#=351199 tim=1551445144537624 WAIT #139927916187824: nam='db file scattered read' ela= 253 file#=5 block#=497 blocks=7 obj#=351199 tim=1551445144538041 WAIT #139927916187824: nam='db file scattered read' ela= 252 file#=5 block#=504 blocks=8 obj#=351199 tim=1551445144538456 WAIT #139927916187824: nam='db file scattered read' ela= 947 file#=5 block#=514 blocks=55 obj#=351199 tim=1551445144539690 FETCH #139927916187824:c=25996,e=43227,p=348,cr=352,cu=0,mis=0,r=0,dep=0,og=1,plh=1838229974,tim=1551445144540502
I’ve inserted a few gaps into this section of the trace file but haven’t deleted any lines. As you can see there’s a series of reads for obj# 351198 (table t1), followed by a series of reads of obj# 351199. We don’t do anything subtle like peeking at t1, switching to peek at t2, then continuing with t1: it’s a simple end to end brute force scan of each table in turn.
But there’s more …
Before I created the test above I decided to check whether I had already done the test some time in the past and written about it. A search through my blog didn’t turn up any notes about this question, but it did uncover the following footnote to an earlier posting about hash joins:
taking a closer look at the set of trace files generated in the broadcast test I discovered that the first set of slaves start their parallel tablescan of t1 but stop after just one read from each slave, then the second set of slaves scans and builds the hash table from t2 before calling for further data from t1.
Naturally I included a link to the footnote in the draft notes for this blog entry and promptly forgot about writing the blog note until a couple of days ago when Chinar Aliyev produced a note on the hash join buffered operation in which he had included an execution plan similar to the broadcast plan from my earlier blog note – so I thought I’d take another look at it, because it looks like some versions of the parallel hash join can do exactly the pre-emptive test that the serial execution plan doesn’t.
So, here’s a query to run against the same data set, and the resulting execution plan – pulled from memory after executing the query:
select /*+ parallel(2) leading(t1 t2) use_hash(t2) pq_distribute(t2 none broadcast) */ t1.v1, t2.v1 -- t1.v1, count(*) from t1, t2 where t1.n20 = 19 and t2.id = t1.id and t2.n10 = 7 ; ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | O/1/M | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | | | | 0 |00:00:00.22 | 10 | 2 | | | | | 1 | PX COORDINATOR | | 1 | | | | | 0 |00:00:00.22 | 10 | 2 | | | | | 2 | PX SEND QC (RANDOM) | :TQ10001 | 0 | 500 | Q1,01 | P->S | QC (RAND) | 0 |00:00:00.01 | 0 | 0 | | | | |* 3 | HASH JOIN | | 1 | 500 | Q1,01 | PCWP | | 0 |00:00:00.12 | 128 | 173 | 1483K| 1483K| 2/0/0| | 4 | PX BLOCK ITERATOR | | 2 | 500 | Q1,01 | PCWC | | 500 |00:00:00.03 | 248 | 173 | | | | |* 5 | TABLE ACCESS FULL | T1 | 25 | 500 | Q1,01 | PCWP | | 500 |00:00:00.02 | 248 | 173 | | | | | 6 | BUFFER SORT | | 2 | | Q1,01 | PCWC | | 2000 |00:00:00.07 | 0 | 0 | 108K| 108K| 2/0/0| | 7 | PX RECEIVE | | 2 | 1000 | Q1,01 | PCWP | | 2000 |00:00:00.06 | 0 | 0 | | | | | 8 | PX SEND BROADCAST | :TQ10000 | 0 | 1000 | Q1,00 | P->P | BROADCAST | 0 |00:00:00.01 | 0 | 0 | | | | | 9 | PX BLOCK ITERATOR | | 2 | 1000 | Q1,00 | PCWC | | 1000 |00:00:00.05 | 248 | 173 | | | | |* 10 | TABLE ACCESS FULL| T2 | 25 | 1000 | Q1,00 | PCWP | | 1000 |00:00:00.04 | 248 | 173 | | | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------
Now the rule for reading a DFO tree in a parallel plan is to follow the table queues (a.k.a. DFO numbers / TQ numbers) – but in this case we have a bit of a mixture as the table scan of t1 isn’t used as the input to a distribution so it’s hard to know whether the first physical activity is supposed to be the tablescan of t2 (feeding the zeroth table queue) or the tablescan of t1 which doesn’t feed a table queue until after the hash join.
Given that t2 is being broadcast we should probably expect to see the first set of parallel query slaves starting first with a scan of “random” sections of t1 to build an in-memory hash table, and then see the second set of parallel query slaves scanning t2 and broadcasting the results to the first set of slaves (every slave gets a copy of every row) to do the probe. We can try to confirm this by looking at the trace files. First we extract “db file scattered read” lines from the four trace files and then we sort them by timestamp.
test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 677 file#=5 block#=220 blocks=4 obj#=351218 tim=1551450113700076 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 214 file#=5 block#=225 blocks=3 obj#=351218 tim=1551450113701131 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 238 file#=5 block#=161 blocks=7 obj#=351218 tim=1551450113702026 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 318 file#=5 block#=177 blocks=6 obj#=351218 tim=1551450113703464 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 371 file#=5 block#=190 blocks=2 obj#=351218 tim=1551450113705168 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 227 file#=5 block#=193 blocks=5 obj#=351218 tim=1551450113705690 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 225 file#=5 block#=138 blocks=6 obj#=351218 tim=1551450113706850 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 487 file#=5 block#=131 blocks=5 obj#=351218 tim=1551450113708512 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 228 file#=5 block#=136 blocks=2 obj#=351218 tim=1551450113709161 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 219 file#=5 block#=294 blocks=7 obj#=351218 tim=1551450113710099 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 150 file#=5 block#=184 blocks=6 obj#=351218 tim=1551450113711398 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 229 file#=5 block#=273 blocks=7 obj#=351218 tim=1551450113712456 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 223 file#=5 block#=280 blocks=7 obj#=351218 tim=1551450113713829 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 253 file#=5 block#=168 blocks=7 obj#=351218 tim=1551450113715124 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 1014 file#=5 block#=308 blocks=5 obj#=351218 tim=1551450113717046 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 266 file#=5 block#=235 blocks=5 obj#=351218 tim=1551450113718684 test_p000_5963.trc:WAIT #140673176773096: nam='db file scattered read' ela= 208 file#=5 block#=241 blocks=2 obj#=351218 tim=1551450113719321 ---- test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 326 file#=5 block#=287 blocks=7 obj#=351218 tim=1551450113700003 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 147 file#=5 block#=198 blocks=2 obj#=351218 tim=1551450113701394 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 72 file#=5 block#=200 blocks=5 obj#=351218 tim=1551450113701840 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 459 file#=5 block#=153 blocks=7 obj#=351218 tim=1551450113703442 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 308 file#=5 block#=259 blocks=7 obj#=351218 tim=1551450113705132 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 313 file#=5 block#=266 blocks=7 obj#=351218 tim=1551450113706540 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 304 file#=5 block#=243 blocks=5 obj#=351218 tim=1551450113707925 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 219 file#=5 block#=248 blocks=2 obj#=351218 tim=1551450113708505 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 590 file#=5 block#=228 blocks=4 obj#=351218 tim=1551450113709705 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 178 file#=5 block#=232 blocks=3 obj#=351218 tim=1551450113710270 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 142 file#=5 block#=205 blocks=3 obj#=351218 tim=1551450113711046 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 98 file#=5 block#=209 blocks=4 obj#=351218 tim=1551450113711457 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 179 file#=5 block#=213 blocks=3 obj#=351218 tim=1551450113712308 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 296 file#=5 block#=216 blocks=4 obj#=351218 tim=1551450113712967 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 188 file#=5 block#=301 blocks=7 obj#=351218 tim=1551450113714238 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 207 file#=5 block#=250 blocks=6 obj#=351218 tim=1551450113715393 test_p001_5967.trc:WAIT #140483692405224: nam='db file scattered read' ela= 567 file#=5 block#=146 blocks=6 obj#=351218 tim=1551450113717686 ---- test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 196 file#=5 block#=522 blocks=7 obj#=351219 tim=1551450113722013 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 145 file#=5 block#=564 blocks=5 obj#=351219 tim=1551450113723239 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 317 file#=5 block#=491 blocks=5 obj#=351219 tim=1551450113724578 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 128 file#=5 block#=497 blocks=2 obj#=351219 tim=1551450113725217 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 495 file#=5 block#=402 blocks=6 obj#=351219 tim=1551450113726250 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 168 file#=5 block#=394 blocks=6 obj#=351219 tim=1551450113727928 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 249 file#=5 block#=424 blocks=7 obj#=351219 tim=1551450113730384 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 120 file#=5 block#=476 blocks=4 obj#=351219 tim=1551450113731225 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 485 file#=5 block#=481 blocks=3 obj#=351219 tim=1551450113732064 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 594 file#=5 block#=387 blocks=5 obj#=351219 tim=1551450113734168 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 323 file#=5 block#=392 blocks=2 obj#=351219 tim=1551450113735040 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 121 file#=5 block#=506 blocks=6 obj#=351219 tim=1551450113736253 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 285 file#=5 block#=440 blocks=6 obj#=351219 tim=1551450113739084 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 345 file#=5 block#=433 blocks=6 obj#=351219 tim=1551450113740940 ---- test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 155 file#=5 block#=484 blocks=4 obj#=351219 tim=1551450113721738 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 131 file#=5 block#=488 blocks=3 obj#=351219 tim=1551450113722260 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 143 file#=5 block#=557 blocks=7 obj#=351219 tim=1551450113723245 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 138 file#=5 block#=454 blocks=2 obj#=351219 tim=1551450113724545 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 124 file#=5 block#=456 blocks=5 obj#=351219 tim=1551450113724954 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 243 file#=5 block#=469 blocks=3 obj#=351219 tim=1551450113725947 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 523 file#=5 block#=472 blocks=4 obj#=351219 tim=1551450113726745 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 178 file#=5 block#=409 blocks=7 obj#=351219 tim=1551450113727781 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 230 file#=5 block#=543 blocks=7 obj#=351219 tim=1551450113729470 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 488 file#=5 block#=515 blocks=7 obj#=351219 tim=1551450113731618 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 356 file#=5 block#=461 blocks=3 obj#=351219 tim=1551450113733840 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 185 file#=5 block#=465 blocks=4 obj#=351219 tim=1551450113735195 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 284 file#=5 block#=536 blocks=7 obj#=351219 tim=1551450113736172 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 204 file#=5 block#=499 blocks=5 obj#=351219 tim=1551450113737957 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 306 file#=5 block#=504 blocks=2 obj#=351219 tim=1551450113738871 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 281 file#=5 block#=446 blocks=2 obj#=351219 tim=1551450113739710 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 158 file#=5 block#=449 blocks=5 obj#=351219 tim=1551450113740159 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 99 file#=5 block#=417 blocks=7 obj#=351219 tim=1551450113741709 tim=1551450113698677 obj#=351219 tim=1551450113698687 obj#=351219 ---- tim=1551450113700003 obj#=351218 tim=1551450113700076 obj#=351218 tim=1551450113701131 obj#=351218 tim=1551450113701394 obj#=351218 tim=1551450113701840 obj#=351218 tim=1551450113702026 obj#=351218 tim=1551450113703442 obj#=351218 tim=1551450113703464 obj#=351218 tim=1551450113705132 obj#=351218 tim=1551450113705168 obj#=351218 tim=1551450113705690 obj#=351218 tim=1551450113706540 obj#=351218 tim=1551450113706850 obj#=351218 tim=1551450113707925 obj#=351218 tim=1551450113708505 obj#=351218 tim=1551450113708512 obj#=351218 tim=1551450113709161 obj#=351218 tim=1551450113709705 obj#=351218 tim=1551450113710099 obj#=351218 tim=1551450113710270 obj#=351218 tim=1551450113711046 obj#=351218 tim=1551450113711398 obj#=351218 tim=1551450113711457 obj#=351218 tim=1551450113712308 obj#=351218 tim=1551450113712456 obj#=351218 tim=1551450113712967 obj#=351218 tim=1551450113713829 obj#=351218 tim=1551450113714238 obj#=351218 tim=1551450113715124 obj#=351218 tim=1551450113715393 obj#=351218 tim=1551450113717046 obj#=351218 tim=1551450113717686 obj#=351218 tim=1551450113718684 obj#=351218 tim=1551450113719321 obj#=351218 -- tim=1551450113721738 obj#=351219 tim=1551450113722013 obj#=351219 tim=1551450113722260 obj#=351219 tim=1551450113723239 obj#=351219 tim=1551450113723245 obj#=351219 tim=1551450113724545 obj#=351219 tim=1551450113724578 obj#=351219 tim=1551450113724954 obj#=351219 tim=1551450113725217 obj#=351219 tim=1551450113725947 obj#=351219 tim=1551450113726250 obj#=351219 tim=1551450113726745 obj#=351219 tim=1551450113727781 obj#=351219 tim=1551450113727928 obj#=351219 tim=1551450113729470 obj#=351219 tim=1551450113730384 obj#=351219 tim=1551450113731225 obj#=351219 tim=1551450113731618 obj#=351219 tim=1551450113732064 obj#=351219 tim=1551450113733840 obj#=351219 tim=1551450113734168 obj#=351219 tim=1551450113735040 obj#=351219 tim=1551450113735195 obj#=351219 tim=1551450113736172 obj#=351219 tim=1551450113736253 obj#=351219 tim=1551450113737957 obj#=351219 tim=1551450113738871 obj#=351219 tim=1551450113739084 obj#=351219 tim=1551450113739710 obj#=351219 tim=1551450113740159 obj#=351219 tim=1551450113740940 obj#=351219 tim=1551450113741709 obj#=351219
I’ve got four slave processes p000 through to p0003, and in the top part of the output you can see that p000/p001 scan object 351218 (after several experiments the object_id for t1 has climbed a bit), and p002/p003 scan object 351219 (t2). Sorting by the timestamps (tim=), though, we can see that there are two reads for t2 (351219) followed by many reads of t1 (351218), finishing with a load of reads of t2. Cross-checking the timestamps on the sorted output with the original extract we can see that the two initial reads of t2 are the first reads by p002 and p003 respectively.
tim=1551450113698677 obj#=351219 test_p002_5971.trc:WAIT #139968162111976: nam='db file scattered read' ela= 260 file#=5 block#=529 blocks=7 obj#=351219 tim=1551450113698677 tim=1551450113698687 obj#=351219 test_p003_5975.trc:WAIT #140601219817960: nam='db file scattered read' ela= 261 file#=5 block#=550 blocks=7 obj#=351219 tim=1551450113698687
It really does look as if at run time Oracle does the scan and build for t1 first but starts with a quick peek at t2 to check that it’s going to return some data. We can be a little more fussy with this test – let’s change some of the data in t2 so that Oracle doesn’t find any rows in the first two scattered reads and see if anything changes.
This is going to take a little fiddling but in my case I can see that the first reads by slaves p002/p003 are looking at file 5, blocks 529 – 535 and blocks 550 – 556; and broadly speaking the scattered reads are working backwards down the table – so I’m going to update a load of data that appears in the second half of the table on the next run of the test with a simple command to put the n10 column value out of range:
update t2 set n10 = 0 where n10 = 7 and id > 750; commit; -- test the query update t2 set n10 = 0 where n10 = 7; commit; -- test the query
After the first update the pattern of reads (in time order) changed do:
- 22 scattered reads of t2
- 34 scattered reads of t1
- 12 scattered reads of t2
This really does look like Oracle checking whether t2 will have any data before doing the full scan and build from t1.
Unfortunately when I did the final update that eliminated all the interesting data from t2 the pattern (in time order) turned into:
- 34 scattered reads of t2
- 34 scattered reads of t1
In other words – having discovered NO data in t2 that could join Oracle still scanned and built the hash table from t1. So maybe there’s a reason other than a sanity check for peeking at the second table, or maybe there’s a bug in the extreme case where the second table produces no data at all. (For reference you might also want to see an older note on how a parallel hash join doesn’t skip the probe table when the build table is empty.)
Just as a quick check that I wasn’t chasing an old bug at this point I repeated the tests on 18.3.0.0 – the same patterns emerged.
My tweet announcing this post resulted in some interesting discussion about how Postgres handles hash joins (including the painful discovery that Oracle and Postgres present the build and hash tables the opposite way up in execution plans). The thread starts here.
The post also resulted in a private email conversation which focused on the boundary cases of the optimisation – starting with the question:
The guide to perfect design says you should work out all the boundary conditions and special cases before starting to create the program. (Though an alternative guide to perfect programming says you should code (and then destroy) the first approximation to find out how hard the simple bit is going to be before you try to cater for the complete requirement.) In the case of the hash join the underlying question is how to avoid doing work that could have been avoided if only you’d known in advance there was no data.
So – the boundary conditions are:
An important point to remember when reviewing the problem is that the (Oracle) optimizer will select the build and probe base on volume of data, the decision isn’t based on comparing the costs of acquiring the individual rowsources. (And having said that I wonder how many more cases I ought to test to see if this is a simplification of a more subtle rule.)
So a more careful strategy for getting as lucky as possible seems to be:
I suppose a refinement on this would be to consider calculating the cost of finding the first row from each table and choosing the first scan based on the first_rows(1) costing – but the I’ve been saying “table” when really the two child operations could be complex sub-plans that would require a lot of work to calculate first row costs.
Comment by Jonathan Lewis — March 4, 2019 @ 10:47 am GMT Mar 4,2019 |