Oracle Scratchpad

August 24, 2016

Basicfile LOBs 4

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 7:02 pm GMT Aug 24,2016

At the end of the previous installment we saw that a single big batch delete would (apparently) attach all the “reusable” chunks into a single freepool, and asked the questions:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

(Okay, I’ll admit it, the third question is a clue about the answer to the second question.)

I find that this process of asking “what’s good, what’s bad, what could possibly go wrong” is an excellent way to prompt thoughts about why Oracle Corp. might have chosen a particular strategy and what that means in terms of the best (or expected) use of the feature and worst threats from misuse of the feature. So lets’s see what thoughts we can come up with.

  • Good idea: The only alternative to using a single freepool when you make chunks reusable is to spread the chunks uniformly across all the freepools – either putting the chunks onto the same free pool that the LOB was previously attached to or doing some sort of round-robin. If you go for either of these fair-share strategies you increase the amount of contention on LOB deletes if many users are deleting at the same time – which sounds like someething you might want to avoid, but LOBs are supposed to be fairly static (somewhere on MoS there’s a note that says the expected behaviour is pretty much: “we thought you’d write once, read many, and not update”) so surely a small amount of contention shouldn’t be a big problem
  • Bad idea: As mentioned in a previous post, it looks like the freepool picked by a process is dependent on the process id – so if you happen to have just a couple of processes doing large deletes they might, coincidentally, pick the same freepool and end up constantly contending with each other rather than drifting in and out of collisions. If, as often happens with archive-like processes, you use one or two processes to delete a large fraction of the data you end up with one or two freepools holding lots of reusable space and all the other freepools holding no freespace – which brings us to the third question.
  • What happens next: Let’s say 3% of your LOB (one day out of a month) is currently “reusable chunks” and the chunks are all attached to the same freepool; your process connects to insert some new LOBs and its process id identifies the wrong freepool. There are no free blocks below the highwater mark and the retention limit is long gone. Does your process (a) add an extent to create some more free space (this is the type of thing that used to happen with manual segment space management, freelist groups and freelists for non-LOB tables and indexes) or (b) start stealing from another freepool that has reusable chunks. In either case what’s going to happen in the longer term ?
  • What happens even later: Imagine you have 28 days of data and use a single process to delete data on the 29th day. For reasons of concurrency you have been running with freepools 20. If option (a) applies then (assuming everything works perfectly) at steady state you will end up with roughly 20 days worth of reusable chunks spread across your 20 freepools before the system stabilises and stops adding unnecessary extents; if option (b) applies then (assuming everything works perfectly) every night you put a load of reusable chunks on one freepool and all through the day your 20 processes are fighting (at the oldest end of the index) to reuse those chunks. I said in an earlier installment that multiple freepools got rid of “the two hot spots” – this single thread deletion strategy has just brought one of them back.

So what really happens ? By the end of the last installment I had deleted the oldest 3,000 LOBs and found them attached as reusable chunks in freepool 2 with several consecutive “empty”  (nrows=81, rrows=0) leaf blocks at the low end of all the other pools.  After running my 4 concurrent processes to insert 750 rows each (i.e. insert the replacements for the 3,000 rows I’ve deleted) this is what the index treedump looks like (with a little editing to show the main breaks between freepools):


----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 60, level: 1)
   leaf: 0x180020e 25166350 (-1: nrow: 22 rrow: 22)
   leaf: 0x1800212 25166354 (0: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (1: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (2: nrow: 74 rrow: 74)
   leaf: 0x1800239 25166393 (3: nrow: 81 rrow: 81)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (7: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (8: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (9: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (10: nrow: 78 rrow: 78)
   leaf: 0x1800232 25166386 (11: nrow: 151 rrow: 151)
---
   leaf: 0x1800226 25166374 (12: nrow: 0 rrow: 0)
   leaf: 0x180020f 25166351 (13: nrow: 64 rrow: 64)
   leaf: 0x1800213 25166355 (14: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (15: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (16: nrow: 81 rrow: 81)
   leaf: 0x1800265 25166437 (17: nrow: 81 rrow: 81)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x1800242 25166402 (24: nrow: 122 rrow: 122)
---
   leaf: 0x1800229 25166377 (25: nrow: 0 rrow: 0)
   leaf: 0x1800214 25166356 (26: nrow: 36 rrow: 36)
   leaf: 0x1800230 25166384 (27: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (28: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (29: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (30: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (31: nrow: 75 rrow: 75)
   leaf: 0x1800231 25166385 (32: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (33: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (34: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (35: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (36: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (37: nrow: 134 rrow: 134)
---
   leaf: 0x1800215 25166357 (38: nrow: 1 rrow: 0)
   leaf: 0x180026e 25166446 (39: nrow: 4 rrow: 0)
   leaf: 0x180021b 25166363 (40: nrow: 1 rrow: 0)
   leaf: 0x180024b 25166411 (41: nrow: 2 rrow: 0)
   leaf: 0x1800276 25166454 (42: nrow: 2 rrow: 0)
   leaf: 0x180024f 25166415 (43: nrow: 0 rrow: 0)
   leaf: 0x180027e 25166462 (44: nrow: 4 rrow: 0)
   leaf: 0x1800221 25166369 (45: nrow: 0 rrow: 0)
   leaf: 0x180027a 25166458 (46: nrow: 0 rrow: 0)
---
   leaf: 0x1800218 25166360 (47: nrow: 0 rrow: 0)
   leaf: 0x180021c 25166364 (48: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (49: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (50: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (51: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (52: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (53: nrow: 152 rrow: 152)
   leaf: 0x1800253 25166419 (54: nrow: 152 rrow: 152)
   leaf: 0x1800257 25166423 (55: nrow: 152 rrow: 152)
   leaf: 0x180025b 25166427 (56: nrow: 152 rrow: 152)
   leaf: 0x180025f 25166431 (57: nrow: 152 rrow: 152)
   leaf: 0x1800263 25166435 (58: nrow: 1 rrow: 1)
----- end tree dump

Highlights

The number of leaf blocks has dropped from 72 to 60 – I didn’t think that this could happen without an index coalesce or rebuild, but maybe it’s a special feature of LOBINDEXes or maybe it’s a new feature of B-trees in general that I hadn’t noticed. Some of the “known empty” leaf blocks seem to have been taken out of the structure.

We still see the half full / full split between the leaf blocks for the first 3 freepools when compared to the top freepool.

There are still some empty leaf blocks (rrow = 0), but apart from the top freepool no more than one per freepool for the other sections that are indexing LOBs.

The section of index that is the freepool 2 section for “reusable” chunks shows an interesting anomaly. There are some leafblocks that are now empty (rrow=0) but were only holding a few index entries (nrow=1-4 rather than the 75 – 140 entries that we saw in the previous installment) at the moment they were last updated; this suggests a certain level of contention with problems of read-consistency, cleanout, and locking between processes trying to reclaim reusable blocks.

It’s just slightly surprising the the top freepool shows several empty leaf blocks – is this just a temporary coincidence, or a boundary case that means the blocks will never be cleaned and re-used; if it’s a fluke will a similar fluke also reappear (eventually) on the other freepools. Is it something to do with the fact that freepool 2 happened to be the freepool that got the first lot of reusable chunks ? Clearly we need to run a few more cycles of deletes and inserts to see what happens.

We have one important conclusion to make but before we make it let’s look at the partial key “col 0” values in the row directory of the root block just to confirm that the breaks I’ve listed above do correspond to each of the separate freepool sections:

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
 1:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0b
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc
 3:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0d
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0d 51
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 db bf f4
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c0 77
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c1 90
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 77
 9:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 fa
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c4 45
11:     col 0; len ..; (..):  00 00 00 01 00 00 09 db c5

12:     col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
13:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
15:     col 0; len ..; (..):  00 02 00 01 00 00 09 db 0b
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 0b ee
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 db bf 8b
18:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c0 a4
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c2 21
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c3 6c
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c4 21
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c5 9e
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 53
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 d6

25:     col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
26:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
27:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
28:     col 0; len ..; (..):  00 04 00 01 00 00 09 db 0b
29:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d
30:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc 64
31:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc b5
32:     col 0; len ..; (..):  00 04 00 01 00 00 09 db bd
33:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd 51
34:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd a2
35:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd f3
36:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 44
37:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 95

38:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
39:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
40:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
41:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
42:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
43:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
44:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
45:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
46:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

47:     col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
48:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
49:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
50:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
51:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
52:     col 0; len ..; (..):  00 06 00 01 00 00 09 db 0a
53:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a
54:     col 0; len 10; (10):  00 06 00 01 00 00 09 db bf da
55:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c1 6c
56:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c2 cc
57:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c4 90
58:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c6 22

I’ve broken the list and numbered the entries to match the treedump above, so it’s each to check that leaf blocks 38 to 46 are the now empty blocks for the reusable chunks. We started the reload with 3,001 entries for reusable chunks all in one freepool; we’ve ended it with none. Something has “stolen” the reusable chunks from freepool 2 so that they could be used for creating new LOBs that were distributed across all the freepools.

Oracle has been very efficient about re-using the index space, with a little bit of wastage creeping in, perhaps caused by coincidences in timing, perhaps by some code that avoids waiting too long when trying to lock index entries. We have a contention point because of the single threaded delete – but it doesn’t appear to be a disaster for space utilisation. Of course we need to look at the level of contention, and repeat the cycle a few times, changing the freepool used for deletion fairly randomly to see if we just got lucky or if the first few deletes are special cases. We can also ask questions about how the “stealing” takes place – does a process steal one index entry at a time, or does it take several consecutive index entries from the same block while it’s got the leaf block locked – but perhaps we don’t really need to know the fine details, the amount of time spent in contention (typically TX waits of some sort) could tell use whether or not we had a significant problem.

Contention and Resources

For each of the processes running the inserts I took a couple of snapshots – session stats and wait events – to see if anything interesting showed up. Naturally, the closer you look the more strange things you find. Here are a few sets of numbers from v$session_event and v$sesstat (in my snapshot format – with the four sessions always reported in the same order);

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: HW - contention                                985           0          93.15        .095           1
enq: HW - contention                                 10           0           5.46        .546           1
enq: HW - contention                              1,001           0         102.27        .102           1
enq: HW - contention                              1,010           0         106.12        .105           1

db file sequential read                           1,038           0          40.75        .039           2
db file sequential read                              39           0           3.21        .082           1
db file sequential read                           1,038           0          28.33        .027           1
db file sequential read                           1,046           0          34.89        .033           1

Name                                                                     Value
----                                                                     -----
physical reads                                                           1,038
physical reads direct                                                      979

physical reads                                                              39
physical reads direct                                                       19

physical reads                                                           1,038
physical reads direct                                                      998

physical reads                                                           1,046
physical reads direct                                                    1,005

session logical reads                                                  114,060
session logical reads                                                   22,950
session logical reads                                                  104,555
session logical reads                                                   93,173

data blocks consistent reads - undo records applied                      2,165
data blocks consistent reads - undo records applied                        119
data blocks consistent reads - undo records applied                      1,222
data blocks consistent reads - undo records applied                        193

My first thought when looking at the wait events was to get an idea of where most of the time went, and I had expected the HW enqueue to be the most likely contender: this enqueue is held not only when the high water mark for a segment is moved, it’s also held when a process is doing any space management for inserting a LOB. So my first suprise was that one session was hardly waiting at all compared to the other sessions.

Then I noticed that this one session was also suffering a very small number of “db file sequential read” waits compared to every other session – but why were ANY sessions doing lots of db file sequential reads: the LOB was declared as nocache so any reads ought to be direct path reads and although Oracle doesn’t always have to wait for EVERY direct path read we should have read (and rewritten) 1,500 “reusable” LOB chunks by direct path reads in each session – I refuse to believe we never waited for ANY of them. So take a look at the session stats: which show us the that the “db file sequential read” waits match exactly with the “physical reads” count but most of the “physical reads” are recorded “physical reads direct” – Oracle is recording the wrong wait event while reading the “reusable” chunks.

Okay, so our direct path read waits are being recorded incorrectly: but one session does hardly any physical reads anyway – so what does that mean ? It means the process ISN’T reusing the chunks – you can’t be reusing chunks if you haven’t t read them. But the dumps from the index tell us that all the reusable chunks have been reused – so how do we resolve that contradiction ? Something is reading the index to identify some reusable chunks, wiping the reference from the index, then not using the chunks so (a) we’ve got some reusable chunks “going missing” and (b) we must be allocating some new chunks from somewhere – maybe bumping the high water mark of the segment, maybe allocating new extents.

Fortunately I had used the dbms_space package to check what the lob segment looked like after I had loaded it. It was 8192 blocks long, with 66 blocks shown as unused and 8,000 (that’s exactly 2 blocks/chunks per LOB) marked as full. After the delete/insert cycle is was 8,576 blocks long, with 8,000 blocks marked as full and 444 marked as unused. We had added three extents of 1MB each that we didn’t need, and one session seems to have avoided some contention by using the new extents for (most of) its LOBs rather than competing for the reusable space with the other LOBs.

Was this a one-off, or a repeatable event. How bad could it get ?

TO BE CONTINUED.

Post-script

Is there a way of discovering from SQL (perhaps with a low-cost PL/SQL function) the freepool for a LOB when it’s defined as Basicfile. You can get the LOBid for a Securefiles LOB using the dbms_lobutil package and the LOBid includes the critical first two bytes – but the package is only relevant to Securefiles. I rather fancy the idea of a process knowing which freepool it is associated with and only deleting LOBs that come out of that freepool.

Update

A thought about knowing the freepool – you could always add the freepool number as a column to the base table and have a “before row insert” trigger (or some other coding strategy) that populated the column with the freepool id. This would require the application to know a little too much about the implementation and about the internals, and would require a code change somewhere if anyone decided to change freepools. (Possibly VPD/RLS/FGAC could be used to hide the details from the outside world). The code to delete LOBs could then delete only those rows that matched its current freepool. WARNING – this is an idle thought, probably not relevant to (most) people, and there may be better strategies to make the effort irrelevant.

Following a little accident with another test I noticed a pattern that suggested that you would only reuse space on your own freepool and might, at the same time, at the same time moving some of your freepool into segment freespace. After a couple of experiments though it looks as if you use the reusable space in your freepool; if you have no reusable space left in your freepool you start using the segment freespace (ignoring the reusable space in other freepools); if there is no free space in the segment you raid EVERY OTHER FREEPOOL and transfer all their reusable space to segment freespace – which is bad news if there are 200,000 reusable chunks in their freepools. Unfortunately, at the same time, you add an extent to the segment and insert your new LOB into that extent.  There’s still plenty of scope for refining details here; I’ve only tried pushing one session into raiding the other freepools, so I don’t know what happens if there are multiple sessions trying to create LOBs while the raid is going on, who blocks whom, for how long, and what happens about new extents etc. Clearly, though, there’s plenty of scope for nasty things to happen.

 

 

1 Comment »

  1. […] the end of the last installment we had seen a test case that caused Oracle to add a couple of redundant new extents to a LOB […]

    Pingback by Basicfile LOBS 5 | Oracle Scratchpad — September 11, 2016 @ 6:50 pm GMT Sep 11,2016 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.