Here’s a quirky little set of results from trace files. First a 10391 at level 64 – this shows the way in which an object was broken into granules for a parallel tablescan. I’ve just listed the first few granules:
Distributors associated to that generator: dist:0 nbgra:52 nid:65535 nbslv:4 size:5131 filedist:0 filno:0 nbgra:0 filedist:1 filno:6 nbgra:52 gnum:0 sz:99 pnum:1048576 rid:(file:6 blk:12..130) gnum:1 sz:99 pnum:1048576 rid:(file:6 blk:131..235) gnum:2 sz:99 pnum:1048576 rid:(file:6 blk:236..341) gnum:3 sz:99 pnum:1048576 rid:(file:6 blk:342..446)
Now, with just a little cosmetic enhancement, the 10046 trace at level 8 of the parallel server process that was given the first granule to process. You can tell from the parameter naming that this is a 10g trace file. Note especially the order in which the blocks are visited.
WAIT #1: nam='PX Deq: Execution Msg' ela= 24415 sleeptime/senderid=268566527 passes=1 p3=0 obj#=-1 tim=1730298979 WAIT #1: nam='direct path read' ela= 18913 file number=6 first dba= 12 block cnt= 5 obj#=59196 tim=1730318924 WAIT #1: nam='direct path read' ela= 5431 file number=6 first dba= 37 block cnt= 4 obj#=59196 tim=1730324753 WAIT #1: nam='direct path read' ela= 5017 file number=6 first dba= 49 block cnt= 8 obj#=59196 tim=1730330528 WAIT #1: nam='direct path read' ela= 6651 file number=6 first dba= 82 block cnt= 7 obj#=59196 tim=1730338312 WAIT #1: nam='direct path read' ela= 5033 file number=6 first dba= 97 block cnt= 8 obj#=59196 tim=1730343998 WAIT #1: nam='direct path read' ela= 2711 file number=6 first dba=114 block cnt= 7 obj#=59196 tim=1730348009 WAIT #1: nam='direct path read' ela= 8746 file number=6 first dba=129 block cnt= 2 obj#=59196 tim=1730357846 WAIT #1: nam='direct path read' ela= 66938 file number=6 first dba= 69 block cnt=12 obj#=59196 tim=1730425271 WAIT #1: nam='direct path read' ela= 80266 file number=6 first dba= 90 block cnt= 7 obj#=59196 tim=1730506969 WAIT #1: nam='direct path read' ela= 4849 file number=6 first dba=105 block cnt= 8 obj#=59196 tim=1730512609 WAIT #1: nam='direct path read' ela= 2831 file number=6 first dba=122 block cnt= 7 obj#=59196 tim=1730516537 WAIT #1: nam='direct path read' ela= 150077 file number=6 first dba= 29 block cnt= 4 obj#=59196 tim=1730667253 WAIT #1: nam='direct path read' ela= 47274 file number=6 first dba= 41 block cnt= 8 obj#=59196 tim=1730715604 WAIT #1: nam='direct path read' ela= 6 file number=6 first dba= 21 block cnt= 4 obj#=59196 tim=1730716822 WAIT #1: nam='direct path read' ela= 42396 file number=6 first dba= 57 block cnt= 8 obj#=59196 tim=1730759816 WAIT #1: nam='PX Deq: Execution Msg' ela= 103 sleeptime/senderid=268566527 passes=1 p3=0 obj#=59196 tim=1730760814
As you can see, this stream of waits corresponds to a PX slave waiting to be told to process the first granule (PX Deq: Execution Msg), and ends with the slave waiting to be told which granule to process next.
But why are the direct path reads so small and in such a funny order? After all, the slave was given a stream of 99 consecutive blocks – and even that’s a little odd because if you count from block 12 to block 130 you get rather more than 99 blocks. Moreover, there seem to be some blocks that didn’t get read.
The answer lies in the title to this piece. I have created the table in a tablespace with system managed extents (the auto allocation type for locally managed tablespaces). In a “clean” datafile, the effect of this is that if you create a table and start to insert data into it, the first 16 extents will be allocated at 64KB, the next 63 at 1MB, followed by 120 at 8MB, before Oracle finally switches to 64MB extents. (Current observations, not documented by Oracle).
However, if you run a parallel “insert /*+ append */”, every parallel execution slave thinks it is populating its own private data segment, so each slave will generate 16 extents of 64KB, then 63 extents of 1MB, and so on. Moreover, if you use 3 separate “insert /*+ append */” statements to load the table, each statement starts the cycle all over again.
On top of that, when you use autoallocate, the last stage in the insertion process does extent trimming to release the trailing free space from the last extent allocated by each slave.
Oracle tries very hard to avoid problems with this extent trimming – it’s done to the nearest “unit size”, i.e. to the next 64KB, 1MB, or 8MB boundary above the actual limit of the data. This has the effect of avoiding randomly sized holes all over the tablespace but it will still leave several holes of varying sizes.
Now assume you have a very large tablespace, are running with 32 CPUs and default parallelism, frequently drop, re-create, and populate objects, even dropping and recreating several objects concurrently. If you do this you will probably find that you gradually accumulate lots of little holes of all sorts of sizes all over the tablespace.
Oracle does try to use up these holes as fast as possible – but only by allocating extents of the legal unit sizes, starting from the start of the file and working along it in order. So if you have lots of little holes scattered through the tablespace, and lots of parallel slaves busily soaking them up, you end up with situations like the above: 108 consecutive blocks in one file consisting of 14 chunks of 8 blocks allocated by 14 different parallel execution slaves, and read in order of extent_id when a parallel tablescan starts.
Mixing very large autoallocate tablespaces with repeated parallel create/insert at high degrees of parallelism can result in objects acquiring a very large number of small extents – which can have a surprising impact on subsequent parallel queries.
Update August 2009:
Christian Antognini reports an enhancement arriving (silently) in 126.96.36.199 that addresses this issue.