Oracle Scratchpad

August 24, 2016

Basicfile LOBs 4

Filed under: Infrastructure,LOBs,Oracle,Uncategorized — Jonathan Lewis @ 7:02 pm BST Aug 24,2016

At the end of the previous installment we saw that a single big batch delete would (apparently) attach all the “reusable” chunks into a single freepool, and asked the questions:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

(Okay, I’ll admit it, the third question is a clue about the answer to the second question.)

I find that this process of asking “what’s good, what’s bad, what could possibly go wrong” is an excellent way to prompt thoughts about why Oracle Corp. might have chosen a particular strategy and what that means in terms of the best (or expected) use of the feature and worst threats from misuse of the feature. So lets’s see what thoughts we can come up with.

  • Good idea: The only alternative to using a single freepool when you make chunks reusable is to spread the chunks uniformly across all the freepools – either putting the chunks onto the same free pool that the LOB was previously attached to or doing some sort of round-robin. If you go for either of these fair-share strategies you increase the amount of contention on LOB deletes if many users are deleting at the same time – which sounds like someething you might want to avoid, but LOBs are supposed to be fairly static (somewhere on MoS there’s a note that says the expected behaviour is pretty much: “we thought you’d write once, read many, and not update”) so surely a small amount of contention shouldn’t be a big problem
  • Bad idea: As mentioned in a previous post, it looks like the freepool picked by a process is dependent on the process id – so if you happen to have just a couple of processes doing large deletes they might, coincidentally, pick the same freepool and end up constantly contending with each other rather than drifting in and out of collisions. If, as often happens with archive-like processes, you use one or two processes to delete a large fraction of the data you end up with one or two freepools holding lots of reusable space and all the other freepools holding no freespace – which brings us to the third question.
  • What happens next: Let’s say 3% of your LOB (one day out of a month) is currently “reusable chunks” and the chunks are all attached to the same freepool; your process connects to insert some new LOBs and its process id identifies the wrong freepool. There are no free blocks below the highwater mark and the retention limit is long gone. Does your process (a) add an extent to create some more free space (this is the type of thing that used to happen with manual segment space management, freelist groups and freelists for non-LOB tables and indexes) or (b) start stealing from another freepool that has reusable chunks. In either case what’s going to happen in the longer term ?
  • What happens even later: Imagine you have 28 days of data and use a single process to delete data on the 29th day. For reasons of concurrency you have been running with freepools 20. If option (a) applies then (assuming everything works perfectly) at steady state you will end up with roughly 20 days worth of reusable chunks spread across your 20 freepools before the system stabilises and stops adding unnecessary extents; if option (b) applies then (assuming everything works perfectly) every night you put a load of reusable chunks on one freepool and all through the day your 20 processes are fighting (at the oldest end of the index) to reuse those chunks. I said in an earlier installment that multiple freepools got rid of “the two hot spots” – this single thread deletion strategy has just brought one of them back.

So what really happens ? By the end of the last installment I had deleted the oldest 3,000 LOBs and found them attached as reusable chunks in freepool 2 with several consecutive “empty”  (nrows=81, rrows=0) leaf blocks at the low end of all the other pools.  After running my 4 concurrent processes to insert 750 rows each (i.e. insert the replacements for the 3,000 rows I’ve deleted) this is what the index treedump looks like (with a little editing to show the main breaks between freepools):


----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 60, level: 1)
   leaf: 0x180020e 25166350 (-1: nrow: 22 rrow: 22)
   leaf: 0x1800212 25166354 (0: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (1: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (2: nrow: 74 rrow: 74)
   leaf: 0x1800239 25166393 (3: nrow: 81 rrow: 81)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (7: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (8: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (9: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (10: nrow: 78 rrow: 78)
   leaf: 0x1800232 25166386 (11: nrow: 151 rrow: 151)
---
   leaf: 0x1800226 25166374 (12: nrow: 0 rrow: 0)
   leaf: 0x180020f 25166351 (13: nrow: 64 rrow: 64)
   leaf: 0x1800213 25166355 (14: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (15: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (16: nrow: 81 rrow: 81)
   leaf: 0x1800265 25166437 (17: nrow: 81 rrow: 81)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x1800242 25166402 (24: nrow: 122 rrow: 122)
---
   leaf: 0x1800229 25166377 (25: nrow: 0 rrow: 0)
   leaf: 0x1800214 25166356 (26: nrow: 36 rrow: 36)
   leaf: 0x1800230 25166384 (27: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (28: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (29: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (30: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (31: nrow: 75 rrow: 75)
   leaf: 0x1800231 25166385 (32: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (33: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (34: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (35: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (36: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (37: nrow: 134 rrow: 134)
---
   leaf: 0x1800215 25166357 (38: nrow: 1 rrow: 0)
   leaf: 0x180026e 25166446 (39: nrow: 4 rrow: 0)
   leaf: 0x180021b 25166363 (40: nrow: 1 rrow: 0)
   leaf: 0x180024b 25166411 (41: nrow: 2 rrow: 0)
   leaf: 0x1800276 25166454 (42: nrow: 2 rrow: 0)
   leaf: 0x180024f 25166415 (43: nrow: 0 rrow: 0)
   leaf: 0x180027e 25166462 (44: nrow: 4 rrow: 0)
   leaf: 0x1800221 25166369 (45: nrow: 0 rrow: 0)
   leaf: 0x180027a 25166458 (46: nrow: 0 rrow: 0)
---
   leaf: 0x1800218 25166360 (47: nrow: 0 rrow: 0)
   leaf: 0x180021c 25166364 (48: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (49: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (50: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (51: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (52: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (53: nrow: 152 rrow: 152)
   leaf: 0x1800253 25166419 (54: nrow: 152 rrow: 152)
   leaf: 0x1800257 25166423 (55: nrow: 152 rrow: 152)
   leaf: 0x180025b 25166427 (56: nrow: 152 rrow: 152)
   leaf: 0x180025f 25166431 (57: nrow: 152 rrow: 152)
   leaf: 0x1800263 25166435 (58: nrow: 1 rrow: 1)
----- end tree dump

Highlights

The number of leaf blocks has dropped from 72 to 60 – I didn’t think that this could happen without an index coalesce or rebuild, but maybe it’s a special feature of LOBINDEXes or maybe it’s a new feature of B-trees in general that I hadn’t noticed. Some of the “known empty” leaf blocks seem to have been taken out of the structure.

We still see the half full / full split between the leaf blocks for the first 3 freepools when compared to the top freepool.

There are still some empty leaf blocks (rrow = 0), but apart from the top freepool no more than one per freepool for the other sections that are indexing LOBs.

The section of index that is the freepool 2 section for “reusable” chunks shows an interesting anomaly. There are some leafblocks that are now empty (rrow=0) but were only holding a few index entries (nrow=1-4 rather than the 75 – 140 entries that we saw in the previous installment) at the moment they were last updated; this suggests a certain level of contention with problems of read-consistency, cleanout, and locking between processes trying to reclaim reusable blocks.

It’s just slightly surprising the the top freepool shows several empty leaf blocks – is this just a temporary coincidence, or a boundary case that means the blocks will never be cleaned and re-used; if it’s a fluke will a similar fluke also reappear (eventually) on the other freepools. Is it something to do with the fact that freepool 2 happened to be the freepool that got the first lot of reusable chunks ? Clearly we need to run a few more cycles of deletes and inserts to see what happens.

We have one important conclusion to make but before we make it let’s look at the partial key “col 0” values in the row directory of the root block just to confirm that the breaks I’ve listed above do correspond to each of the separate freepool sections:

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
 1:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0b
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc
 3:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0d
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0d 51
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 db bf f4
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c0 77
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c1 90
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 77
 9:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 fa
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c4 45
11:     col 0; len ..; (..):  00 00 00 01 00 00 09 db c5

12:     col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
13:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
15:     col 0; len ..; (..):  00 02 00 01 00 00 09 db 0b
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 0b ee
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 db bf 8b
18:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c0 a4
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c2 21
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c3 6c
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c4 21
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c5 9e
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 53
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 d6

25:     col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
26:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
27:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
28:     col 0; len ..; (..):  00 04 00 01 00 00 09 db 0b
29:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d
30:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc 64
31:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc b5
32:     col 0; len ..; (..):  00 04 00 01 00 00 09 db bd
33:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd 51
34:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd a2
35:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd f3
36:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 44
37:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 95

38:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
39:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
40:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
41:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
42:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
43:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
44:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
45:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
46:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

47:     col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
48:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
49:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
50:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
51:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
52:     col 0; len ..; (..):  00 06 00 01 00 00 09 db 0a
53:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a
54:     col 0; len 10; (10):  00 06 00 01 00 00 09 db bf da
55:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c1 6c
56:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c2 cc
57:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c4 90
58:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c6 22

I’ve broken the list and numbered the entries to match the treedump above, so it’s each to check that leaf blocks 38 to 46 are the now empty blocks for the reusable chunks. We started the reload with 3,001 entries for reusable chunks all in one freepool; we’ve ended it with none. Something has “stolen” the reusable chunks from freepool 2 so that they could be used for creating new LOBs that were distributed across all the freepools.

Oracle has been very efficient about re-using the index space, with a little bit of wastage creeping in, perhaps caused by coincidences in timing, perhaps by some code that avoids waiting too long when trying to lock index entries. We have a contention point because of the single threaded delete – but it doesn’t appear to be a disaster for space utilisation. Of course we need to look at the level of contention, and repeat the cycle a few times, changing the freepool used for deletion fairly randomly to see if we just got lucky or if the first few deletes are special cases. We can also ask questions about how the “stealing” takes place – does a process steal one index entry at a time, or does it take several consecutive index entries from the same block while it’s got the leaf block locked – but perhaps we don’t really need to know the fine details, the amount of time spent in contention (typically TX waits of some sort) could tell use whether or not we had a significant problem.

Contention and Resources

For each of the processes running the inserts I took a couple of snapshots – session stats and wait events – to see if anything interesting showed up. Naturally, the closer you look the more strange things you find. Here are a few sets of numbers from v$session_event and v$sesstat (in my snapshot format – with the four sessions always reported in the same order);

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: HW - contention                                985           0          93.15        .095           1
enq: HW - contention                                 10           0           5.46        .546           1
enq: HW - contention                              1,001           0         102.27        .102           1
enq: HW - contention                              1,010           0         106.12        .105           1

db file sequential read                           1,038           0          40.75        .039           2
db file sequential read                              39           0           3.21        .082           1
db file sequential read                           1,038           0          28.33        .027           1
db file sequential read                           1,046           0          34.89        .033           1

Name                                                                     Value
----                                                                     -----
physical reads                                                           1,038
physical reads direct                                                      979

physical reads                                                              39
physical reads direct                                                       19

physical reads                                                           1,038
physical reads direct                                                      998

physical reads                                                           1,046
physical reads direct                                                    1,005

session logical reads                                                  114,060
session logical reads                                                   22,950
session logical reads                                                  104,555
session logical reads                                                   93,173

data blocks consistent reads - undo records applied                      2,165
data blocks consistent reads - undo records applied                        119
data blocks consistent reads - undo records applied                      1,222
data blocks consistent reads - undo records applied                        193

My first thought when looking at the wait events was to get an idea of where most of the time went, and I had expected the HW enqueue to be the most likely contender: this enqueue is held not only when the high water mark for a segment is moved, it’s also held when a process is doing any space management for inserting a LOB. So my first suprise was that one session was hardly waiting at all compared to the other sessions.

Then I noticed that this one session was also suffering a very small number of “db file sequential read” waits compared to every other session – but why were ANY sessions doing lots of db file sequential reads: the LOB was declared as nocache so any reads ought to be direct path reads and although Oracle doesn’t always have to wait for EVERY direct path read we should have read (and rewritten) 1,500 “reusable” LOB chunks by direct path reads in each session – I refuse to believe we never waited for ANY of them. So take a look at the session stats: which show us the that the “db file sequential read” waits match exactly with the “physical reads” count but most of the “physical reads” are recorded “physical reads direct” – Oracle is recording the wrong wait event while reading the “reusable” chunks.

Okay, so our direct path read waits are being recorded incorrectly: but one session does hardly any physical reads anyway – so what does that mean ? It means the process ISN’T reusing the chunks – you can’t be reusing chunks if you haven’t t read them. But the dumps from the index tell us that all the reusable chunks have been reused – so how do we resolve that contradiction ? Something is reading the index to identify some reusable chunks, wiping the reference from the index, then not using the chunks so (a) we’ve got some reusable chunks “going missing” and (b) we must be allocating some new chunks from somewhere – maybe bumping the high water mark of the segment, maybe allocating new extents.

Fortunately I had used the dbms_space package to check what the lob segment looked like after I had loaded it. It was 8192 blocks long, with 66 blocks shown as unused and 8,000 (that’s exactly 2 blocks/chunks per LOB) marked as full. After the delete/insert cycle is was 8,576 blocks long, with 8,000 blocks marked as full and 444 marked as unused. We had added three extents of 1MB each that we didn’t need, and one session seems to have avoided some contention by using the new extents for (most of) its LOBs rather than competing for the reusable space with the other LOBs.

Was this a one-off, or a repeatable event. How bad could it get ?

TO BE CONTINUED.

Post-script

Is there a way of discovering from SQL (perhaps with a low-cost PL/SQL function) the freepool for a LOB when it’s defined as Basicfile. You can get the LOBid for a Securefiles LOB using the dbms_lobutil package and the LOBid includes the critical first two bytes – but the package is only relevant to Securefiles. I rather fancy the idea of a process knowing which freepool it is associated with and only deleting LOBs that come out of that freepool.

Update

A thought about knowing the freepool – you could always add the freepool number as a column to the base table and have a “before row insert” trigger (or some other coding strategy) that populated the column with the freepool id. This would require the application to know a little too much about the implementation and about the internals, and would require a code change somewhere if anyone decided to change freepools. (Possibly VPD/RLS/FGAC could be used to hide the details from the outside world). The code to delete LOBs could then delete only those rows that matched its current freepool. WARNING – this is an idle thought, probably not relevant to (most) people, and there may be better strategies to make the effort irrelevant.

Following a little accident with another test I noticed a pattern that suggested that you would only reuse space on your own freepool and might, at the same time, at the same time moving some of your freepool into segment freespace. After a couple of experiments though it looks as if you use the reusable space in your freepool; if you have no reusable space left in your freepool you start using the segment freespace (ignoring the reusable space in other freepools); if there is no free space in the segment you raid EVERY OTHER FREEPOOL and transfer all their reusable space to segment freespace – which is bad news if there are 200,000 reusable chunks in their freepools. Unfortunately, at the same time, you add an extent to the segment and insert your new LOB into that extent.  There’s still plenty of scope for refining details here; I’ve only tried pushing one session into raiding the other freepools, so I don’t know what happens if there are multiple sessions trying to create LOBs while the raid is going on, who blocks whom, for how long, and what happens about new extents etc. Clearly, though, there’s plenty of scope for nasty things to happen.

 

 

August 22, 2016

Basicfile LOBS 3

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 5:56 pm BST Aug 22,2016

In the previous article in this mini-series I described how the option for setting freepools N when defining Basicfile LOBs was a feature aimed at giving you improved concurrency for inserts and deletes that worked by splitting the LOBINDEX into 2N sections: N sections to index the current LOB chunks by LOB id, alternating with N sections to map the reusable LOB chunks by deletion time.

In this article we’ll look a little further into the lifecycle of the LOB segment but before getting into the details I’ll just throw out a couple of consequences of the basic behaviour of LOBs that might let you pick the best match for the workload you have to deal with.

  • If you have enabled storage in row the first 12 chunks of a lob will be identified by the LOB Locator stored in the row, so if all your LOBs are sized between 4KB and 96KB (approximately) the LOB Index will consist only of entries for the reusable LOB space due to deleted LOBs even though the LOBs themselves will be stored out of line. This makes it look like a good idea to enable storage in row even when you expect all of your (smallish) LOBs to be stored out of row.
  • It’s quite nice to cache LOBs (at least for a little while) if your pattern of use means you access a specific LOB for a little while before it ceases to be interesting; but LOBs can swamp a large fraction of the buffer cache if you’re not careful. If you expect to follow this pattern of behaviour you might define a RECYCLE cache and then assign the LOB to that cache so that you get the benefits of caching while still protecting the main volume of your buffer cache.
  • Depending on the expected size of your LOBs you may have a good justification for creating a tablespace of a non-standard size for the LOB segment so that it takes fewer block reads to read the entire LOB. If (for example) you have a LOB which is always in the range of 62KB then a tablespace with a blocksize of 32KB would be a good choice because the LOB could be read with just two block reads. A fringe benefit of the non-standard block size, of course, is that you have to define a non-standard cache, which separates the LOB activity from the rest of the buffer cache. (Note: Oracle reads LOBs one chunk at a time, so the number of LOB reads – as opposed to block reads – for a 32KB chunk is the same whether the block size is 8KB or 32KB)
  • If you’re going to be logging your LOBs then remember that nocache LOBs will write entire chunks into the redo logs – think about how much extra redo this might generate: it might be better to have a small recycle cache and cache your LOBS as cached LOBs are logged at the byte level. (You don’t want a 32KB block size, nocache, logging if your LOBs are all 33KB).

The LOB lifetime

Before deciding on the suitability of a feature the first thing to do is define what you’re trying to achieve so that you can think realistically about where the threats may be and what tests are going to be important – so I’m going to describe a scenario, then talk about what threats might appear based on the current details I’ve given about Basicfile LOBs and freepools.

  • We have many processes inserting “small” (16KB to 24KB) LOBs concurrently in bursts during the day.
  • Typically we peak at about 20 processes inserting at the same moment, and we end up with about 100K new LOBs per day though this varies between 50K and 200K.
  • The inserts are all “insert one row; commit”.
  • The LOBs have to be kept for 28 days, after which they (the rows that hold them) are deleted by an overnight batch job.
  • The LOBs have to be logged and the database is running in archivelog mode

As soon as you see the “aged 28 days” you might immediately think “partitioning” (though perhaps your first thought might be that restaurant in Cincinnati airport that hangs its beef to air-dry for 28 days before cooking). Unfortunately not everyone has licensed the partitioning option, so what do you have to worry about when you start to design for this requirement. (We’re also going to assume that securefiles are going to be saved for another blog mini-series).

Clearly we should make use of multple freepools to avoid the insert contention on the LOBINDEX. With about 20 concurrent processes we might immediate go for freepools 20, but we might decide that a smaller number like 4 or 8 is sufficient. We probably ought to do some tests to see if we can discover any penalties for larger numbers of freepools, and to see what sort of contention we get with a smaller number of freepools.

We got a hint from the previous article that when a process deletes a LOB it indexes the reusable chunks in the same freepool as it inserts LOBs – at least, that’s what seemed to happen in our little test case in the previous article. Does Oracle always follow this pattern, or will a multi-row delete, or a large number of single “delete;commt;” cycles spread the reusable chunks evenly across all the available freepools ? If you do a single large delete do you end up with all the reusable space in one freepool – if so, does it matter or should we have multiple processes do our “big batch delete” ?

On second thoughts, my little demo showed that when you insert a LOB into freepool X and then delete it the reusable space goes into freepool X. Maybe I’ve misinterpreted the test and need to do a better test; maybe the reusable space goes into the freepool that the LOB was originally attached to, not into the freepool dictated by the process id. That would mean that a bulk delete would tend to spread the LOBs across all the freepools – which means if you used multiple processes to delete data they might cause contention on the “reusable” segments of the LOBINDEX.

If we do a single large delete and all the reusable chunks go into the same freepool what happens when we start inserting new LOBs ? If the LOB segment is “full” is it only the processes associated with that one freepool that can use the reusable space, or will EVERY process start to raid the freepool that has the only reusable space If the latter then all we’ve done by using multiple freepools is postpone (by roughly 28 days) the moment when we start to get contention on our LOBINDEX ?

Fortunately if we’ve made some poor choices in the orginal design Oracle does allow us to “rebuild freepools”, and even change the number of freepools:


alter table t1 modify lob (c1) (rebuild freepools);
alter table t1 modify lob (c1) (freepools (3));

Mind you, there is a little note on MoS that rebuilding freepools “may take some time” and locks the table in exclusive mode while it’s going on. So perhaps we should check to see how the rebuild works, and try to figure out how long it might take. A maxim for dealing with very large objects is that you really want to get it right first time because it’s hard to test the effects of change especially since you probably end up wanting to do your final tests on a backup copy of the production system.

Getting Started

I’ve specified 100K LOBs per day, sized between 16KB and 24KB, kept for 28 days – that’s about 50 GB, and I don’t really want to sit waiting for Oracle to build that much data while running 20 concurrent processes that are logging and generating archived redo log. (Especially since I may want to repeat the exercise two or three times with different numbers of freepools.) I’m going to start small and grow the scale when it’s necessary.

I’ll start with 4 concurrent processes inserting 1,000 LOBs each, sized at 12KB, with freepools 4, and I’ll rig the system very carefully so that each process uses a different freepool. After that I’ll run a single batch delete to delete the first 3,000 LOBs – I’ll pick a process that ought to use freepool 1 or 2 (i.e. not 0 or 3, the “end” freepools); then I’ll repeat the insert cycle but insert just 750 LOBs per process. At various points in this sequence of events I’ll stop and dump some index blocks and look at some stats to see if I can spot any important patterns emerging.

Once I’ve got through that cycle I’ll decide what to do next – the first set of results may produce some important new questions – but I’m guessing that I’ll probably end up repeating the “delete / insert” cycle at least one more time.

Here’s a little code to create a suitable table,


create sequence s1 cache 10000;

create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as basicfile
    text_lob(
            disable storage in row
            chunk 8k
            retention
            nocache
            freepools 4
            tablespace test_8k_assm
)
;

declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 0..0 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
;

I’ve inserted a row to make sure that all the objects appear in all the right places. The code I’ve used to do this insert is a version of the code that I’m going to use for the concurrency testing but restricted to insert one row with an id of zero. In the concurrency test I’ll make use of the sequence I’ve created to act as the primary key to the table.

Having created the table I then start four more sessions, carefully ensuring that they will each pick a different freepool. To make sure I had one session per freepool I just kept connecting sessions and running a silly little check for each session’s process id (pid) until I had four that returned each of the values from 0 to 3:

select mod(pid,4) from v$process where addr = (
        select paddr from v$session where sid = (
                select sid from v$mystat where rownum = 1
        )
)
;

Once I had the four extra sessions set up, I issued a simple “lock table t1 in exclusive mode” from my original session then started the following script in each of the other four:


spool temp&1

declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 1..1000 loop
                insert into t1 values (s1.nextval, m_v1);
                commit;
        end loop;
end;
/

spool off

(I supplied A, B, C, and D as the first parameter to the script so that I got four sets of output, but I haven’t included the code I used to get a snapshot of the session stats, session waits, and system enqueues recorded by each session.)

First check – did I get all four freepools evenly used (which is what I had assumed would happen when I chose the 4 process ids so carefully. I can check this by doing a block dump of the LOBINDEX root block because with 4,001 entries I’m (almost certainly) going to get a root block, no further branch levels, and a few dozen leaf blocks.

As with all B-tree indexes the “row directory” of the root block will contain a list of “truncated” key values that allow Oracle to search down to the correct block in the next layer of the index so I’m going to extract just the key values, and only the first column of those keys in the same way that I did with the previous article. This means every line in the following output shows you, in order, the first LOB id (with a few of them truncated) in each leaf block:

col 0; len 10; (10):  00 00 00 01 00 00 09 da fe a7
col 0; len 10; (10):  00 00 00 01 00 00 09 db 00 24
col 0; len 10; (10):  00 00 00 01 00 00 09 db 01 6f
col 0; len 10; (10):  00 00 00 01 00 00 09 db 02 ec
col 0; len  9; ( 9):  00 00 00 01 00 00 09 db 04
col 0; len 10; (10):  00 00 00 01 00 00 09 db 05 7c
col 0; len 10; (10):  00 00 00 01 00 00 09 db 07 2b
col 0; len 10; (10):  00 00 00 01 00 00 09 db 07 e0
col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
col 0; len  9; ( 9):  00 00 00 01 00 00 09 db 0b
col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc

col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
col 0; len 10; (10):  00 02 00 01 00 00 09 da fe 81
col 0; len 10; (10):  00 02 00 01 00 00 09 db 00 62
col 0; len 10; (10):  00 02 00 01 00 00 09 db 01 ad
col 0; len 10; (10):  00 02 00 01 00 00 09 db 02 94
col 0; len 10; (10):  00 02 00 01 00 00 09 db 04 11
col 0; len 10; (10):  00 02 00 01 00 00 09 db 04 f8
col 0; len 10; (10):  00 02 00 01 00 00 09 db 06 11
col 0; len 10; (10):  00 02 00 01 00 00 09 db 07 f2
col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
col 0; len  9; ( 9):  00 02 00 01 00 00 09 db 0b

col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
col 0; len 10; (10):  00 04 00 01 00 00 09 da fe 4c
col 0; len 10; (10):  00 04 00 01 00 00 09 da ff c9
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 01
col 0; len 10; (10):  00 04 00 01 00 00 09 db 01 f8
col 0; len 10; (10):  00 04 00 01 00 00 09 db 03 75
col 0; len 10; (10):  00 04 00 01 00 00 09 db 04 5c
col 0; len 10; (10):  00 04 00 01 00 00 09 db 06 3d
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 07
col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
col 0; len  9; ( 9):  00 04 00 01 00 00 09 db 0b

col 0; len  2; ( 2):  00 06
col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
col 0; len  9; ( 9):  00 06 00 01 00 00 09 db 0a
col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a

As you can see, we have the expected pattern (for 4 freepools) of entries starting with (00 00), (00 02), (00 04), and (00 06); but you might wonder why there are 11 leaf blocks for 00, 12 leaf blocks for 02 and 04, and only 8 leaf blocks for 06. We can answer the 11/12 anomaly by remembering that any branch blocks will have a “leftmost child” entry that won’t appear in the row directory – so the 12th leaf (or rather the 1st leaf) block for 00 is being pointed to by the “LMC”. But what about the missing blocks for 06 ? A treedump shows the answer:


branch: 0x1800204 25166340 (0: nrow: 44, level: 1)
   leaf: 0x1800225 25166373 (-1: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (0: nrow: 81 rrow: 81)
   leaf: 0x1800231 25166385 (1: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (2: nrow: 81 rrow: 81)
   leaf: 0x1800239 25166393 (3: nrow: 75 rrow: 75)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 81)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 132 rrow: 132)

   leaf: 0x1800226 25166374 (11: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (12: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (13: nrow: 81 rrow: 81)
   leaf: 0x1800232 25166386 (14: nrow: 81 rrow: 81)
   leaf: 0x1800236 25166390 (15: nrow: 81 rrow: 81)
   leaf: 0x180023a 25166394 (16: nrow: 81 rrow: 81)
   leaf: 0x180023e 25166398 (17: nrow: 81 rrow: 81)
   leaf: 0x1800207 25166343 (18: nrow: 81 rrow: 81)
   leaf: 0x180020b 25166347 (19: nrow: 81 rrow: 81)
   leaf: 0x180020f 25166351 (20: nrow: 81 rrow: 81)
   leaf: 0x1800213 25166355 (21: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (22: nrow: 111 rrow: 111)

   leaf: 0x1800229 25166377 (23: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (24: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (25: nrow: 78 rrow: 78)
   leaf: 0x1800237 25166391 (26: nrow: 81 rrow: 81)
   leaf: 0x180023b 25166395 (27: nrow: 81 rrow: 81)
   leaf: 0x180023f 25166399 (28: nrow: 81 rrow: 81)
   leaf: 0x1800208 25166344 (29: nrow: 81 rrow: 81)
   leaf: 0x180020c 25166348 (30: nrow: 76 rrow: 76)
   leaf: 0x1800210 25166352 (31: nrow: 81 rrow: 81)
   leaf: 0x1800214 25166356 (32: nrow: 81 rrow: 81)
   leaf: 0x1800230 25166384 (33: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (34: nrow: 107 rrow: 107)

   leaf: 0x180022b 25166379 (35: nrow: 78 rrow: 78)
   leaf: 0x1800218 25166360 (36: nrow: 152 rrow: 152)
   leaf: 0x180021c 25166364 (37: nrow: 152 rrow: 152)
   leaf: 0x1800220 25166368 (38: nrow: 152 rrow: 152)
   leaf: 0x1800224 25166372 (39: nrow: 152 rrow: 152)
   leaf: 0x1800228 25166376 (40: nrow: 152 rrow: 152)
   leaf: 0x180022c 25166380 (41: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (42: nrow: 11 rrow: 11)

Although there are little glitches along the way the leaf blocks for freepools 0, 1, and 2 have 81 entries per leaf block, the leaf blocks for freepool 3 have 152 – it’s the difference between inserting rows at the end of an index and getting a “90/10” split compared with inserting somewhere in the middle and getting a “50/50” split. In fact, although Oracle uses the idgen1$ sequence to drive the generation of LOB ids, the way it “batches” IDs (the sequence cache size is 50) means that LOB ids can appear out of order when generated by multiple sessions – even in the same freepool: it is possible for the top freepool to suffer 50/50 splits though these are likely to appear relatively rarely – at least, that is, until I start deleting LOBs when a “00 07” section of reusable chunks may appear. (As a side effect, the LOB ids in my table and the id column on the underlying table are out of order with respect to each other.)

You’ll notice that I’ve left blank links in the treedump list corresponding to the breaks between the free pools (remember there is a “-1” entry in the treedump for the “leftmost child” that doesn’t appear in the row directory). This helps to explain the few leaf blocks with unusual row counts – they’re the ones that at (or very near) the boundaries between freepools.

Bulk Delete

After loading some data in with concurrent inserts and multiple freepools I’m going to do a single big delete from the table to get rid of a lot of “old” data. In fact all I’ll do is delete the rows where id <= 3000. The big question is this – will a single delete put all the reusable blocks into a single freepool, or will it put the reusable space for each LOB into the freepool that the LOB was originally in, or will it find some other way to spread the reusable space evenly across all the free pools ? One freepool or many – both options have good points, both options have bad points.

Here’s what I got as the treedump after the delete:


branch: 0x1800204 25166340 (0: nrow: 72, level: 1)
   leaf: 0x1800225 25166373 (-1: nrow: 81 rrow: 0)
   leaf: 0x180022d 25166381 (0: nrow: 81 rrow: 0)
   leaf: 0x1800231 25166385 (1: nrow: 81 rrow: 0)
   leaf: 0x1800235 25166389 (2: nrow: 81 rrow: 0)
   leaf: 0x1800239 25166393 (3: nrow: 75 rrow: 0)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 0)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 0)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 0)
   leaf: 0x180020e 25166350 (7: nrow: 81 rrow: 22)
   leaf: 0x1800212 25166354 (8: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (9: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (10: nrow: 132 rrow: 120)

   leaf: 0x1800226 25166374 (11: nrow: 81 rrow: 0)
   leaf: 0x180022a 25166378 (12: nrow: 81 rrow: 0)
   leaf: 0x180022e 25166382 (13: nrow: 81 rrow: 0)
   leaf: 0x1800232 25166386 (14: nrow: 81 rrow: 0)
   leaf: 0x1800236 25166390 (15: nrow: 81 rrow: 0)
   leaf: 0x180023a 25166394 (16: nrow: 81 rrow: 0)
   leaf: 0x180023e 25166398 (17: nrow: 81 rrow: 0)
   leaf: 0x1800207 25166343 (18: nrow: 81 rrow: 0)
   leaf: 0x180020b 25166347 (19: nrow: 81 rrow: 0)
   leaf: 0x180020f 25166351 (20: nrow: 81 rrow: 64)
   leaf: 0x1800213 25166355 (21: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (22: nrow: 111 rrow: 101)

   leaf: 0x1800229 25166377 (23: nrow: 81 rrow: 0)
   leaf: 0x180022f 25166383 (24: nrow: 81 rrow: 0)
   leaf: 0x1800233 25166387 (25: nrow: 78 rrow: 0)
   leaf: 0x1800237 25166391 (26: nrow: 81 rrow: 0)
   leaf: 0x180023b 25166395 (27: nrow: 81 rrow: 0)
   leaf: 0x180023f 25166399 (28: nrow: 81 rrow: 0)
   leaf: 0x1800208 25166344 (29: nrow: 81 rrow: 0)
   leaf: 0x180020c 25166348 (30: nrow: 76 rrow: 0)
   leaf: 0x1800210 25166352 (31: nrow: 81 rrow: 0)
   leaf: 0x1800214 25166356 (32: nrow: 81 rrow: 36)
   leaf: 0x1800230 25166384 (33: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (34: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (35: nrow: 139 rrow: 139)

   leaf: 0x1800227 25166375 (36: nrow: 138 rrow: 138)
   leaf: 0x1800205 25166341 (37: nrow: 126 rrow: 126)
   leaf: 0x1800219 25166361 (38: nrow: 82 rrow: 82)
   leaf: 0x1800272 25166450 (39: nrow: 95 rrow: 95)
   leaf: 0x1800209 25166345 (40: nrow: 118 rrow: 118)
   leaf: 0x180021f 25166367 (41: nrow: 143 rrow: 143)
   leaf: 0x180020d 25166349 (42: nrow: 81 rrow: 81)
   leaf: 0x1800243 25166403 (43: nrow: 90 rrow: 90)
   leaf: 0x1800222 25166370 (44: nrow: 147 rrow: 147)
   leaf: 0x1800211 25166353 (45: nrow: 81 rrow: 81)
   leaf: 0x1800247 25166407 (46: nrow: 73 rrow: 73)
   leaf: 0x1800223 25166371 (47: nrow: 98 rrow: 98)
   leaf: 0x180026a 25166442 (48: nrow: 98 rrow: 98)
   leaf: 0x180021d 25166365 (49: nrow: 127 rrow: 127)
   leaf: 0x1800266 25166438 (50: nrow: 131 rrow: 131)
   leaf: 0x1800215 25166357 (51: nrow: 133 rrow: 133)
   leaf: 0x180026e 25166446 (52: nrow: 141 rrow: 141)
   leaf: 0x180021b 25166363 (53: nrow: 82 rrow: 82)
   leaf: 0x180024b 25166411 (54: nrow: 93 rrow: 93)
   leaf: 0x1800276 25166454 (55: nrow: 109 rrow: 109)
   leaf: 0x180024f 25166415 (56: nrow: 77 rrow: 77)
   leaf: 0x180021e 25166366 (57: nrow: 143 rrow: 143)
   leaf: 0x180027e 25166462 (58: nrow: 126 rrow: 126)
   leaf: 0x1800221 25166369 (59: nrow: 93 rrow: 93)
   leaf: 0x1800253 25166419 (60: nrow: 82 rrow: 82)
   leaf: 0x180027a 25166458 (61: nrow: 97 rrow: 97)
   leaf: 0x1800257 25166423 (62: nrow: 84 rrow: 84)

   leaf: 0x180022b 25166379 (63: nrow: 78 rrow: 0)
   leaf: 0x1800218 25166360 (64: nrow: 152 rrow: 0)
   leaf: 0x180021c 25166364 (65: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (66: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (67: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (68: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (69: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (70: nrow: 11 rrow: 11)


The number of leaf blocks has gone up from 44 to 72 (but that shouldn’t be too much of a surprise – index leaf block space can’t be reused until after the commit, so we were bound to grow the index to insert the entries for reusable chunks).

As before I’ve inserted a few blank lines to break the list into the separate index sections, and you can see that the first few blocks in each of the first three freepools has nrow = 81 and (typically) rrow = 0. These are the leaf blocks where all the LOB entries have been marked as deleted. There are a couple of variations – leaf block 10, for example, shows nrow = 132, rrow = 120: this is the leaf block where freepool 0 (LOB section) overlapped with freepool 1 (LOB section), and the first 10 LOBs in freepool 1 have been marked as deleted. The LOB section for freepool 4 follows the same sort of pattern, though nrow = 152 in most of the blocks.

The important detail is in leaf blocks 36 to 62 – which show nrow = rrow throughout, but with a degree of randomness as to the actual number of index entries. These are the leaf blocks that record the “reusable chunks”, and they’ve all been associated with freepool 2 (counting from zero). There are several details that combine to explain why the numbers of entries per leaf block vary so much, but I don’t want to get too distracted by them now; remember, though, that I pointed out that the LOB ids and table id column weren’t in synch with each other so part of what you’re seeing here is 50/50 leaf node splits followed by a little back-filling.

Again I’ve extracted the “col 0” values from the block dump of the root block – I won’t show all of them, I’ll just show you the entries from entries 35 to 63 so that you can see the leaf block pointers for the “reusable” section of freepool 2, and the LOB section of freepools 2 and 3:


col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d

col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

col 0; len 02; (02):  00 06

The starting “00 05” tells us that this is the reusable chunks associated with freepool 2 (2 * 2 + 1 = 5), and the fact that the next four bytes are identical across the entries tells you that I managed to delete my 3,000 LOBs in less than one second.

After seeing the effect of this monolithic delete you should now be asking yourself a few questions, such as:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

TO BE CONTINUED …

 

August 18, 2016

Basicfile LOBs 2

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 1:21 pm BST Aug 18,2016

There are probably quite a lot of people still using Basicfile LOBs, although Oracle Corp. would like everyone to migrate to the (now default) Securefile LOBs. If you’re on Basicfile, though, and don’t want (or aren’t allowed) to change just yet here are a few notes that may help you understand some of the odd performance and storage effects.

Of course, there are a lot of variations in how you declare the LOB – pctversion vs. retention, cache vs. nocache, logging vs. nologging, enable vs. disable storage in row, and I can’t cover all the combinations – so what I’ll be looking at is a general strategy for handling a large number of small LOBs that are being inserted into the database at a fairly high degree of concurrency, and then being deleted a few weeks later so, in theory, the LOB segment should end up at a steady state with “N” days worth of data stored. The driver behind this modelling is a problem I was asked to examine a little while ago.

Some background details on Basicfile LOBs

If the LOB column is defined as “enable storage in row” then a very small LOB (up to 3,960 bytes) will be stored almost as if it were an ordinary column in the row; if the size of a LOB is a little larger than this limit then the LOB will be stored in chunks in the LOB segment and pointers to the first 12 chunks will be stored in the row, with pointers for further chunks stored in the LOBINDEX. The chunk size defined for a LOB column can be up to 32KB – though the default chunk size is the block size for the containing tablespace –  so it’s possible to store a LOB of nearly 384KB before Oracle needs to create index entries in the LOBINDEX, though most people use 8KB blocks and will start using the LOBINDEX when a LOB gets close to 96KB.

If the LOB column is defined as “disable storage in row” then, no matter how small it really is, it will always take up at least one chunk in the LOB segment and will have a corresponding index entry in the LOBINDEX.  For reasons of efficiency an entry in LOBINDEX always “carries” 32 bytes of pointer data, allowing it to list up to 8 chunks.

When a LOB is deleted (replaced by a null, an empty_lob(), or a new LOB value) the previous state of the base table row and the LOBINDEX will be preserved in the undo segment in the ordinary manner but the previous version of the LOB data itself is simply left in the segment (and a new version of the LOB created if the operation is an “update”). The chunks making up the old version are added to the LOBINDEX with a key based on the time (seconds since 1st Jan 1970) the delete took place – this means that when Oracle wants to re-use space in the LOB segment it can walk the LOBINDEX in order to find the chunks that were marked as available for reuse the greatest time into the past. (It also means that the LOBINDEX is one of the strangest in the Oracle pantheon – part of it indexes “reusable chunks keyed by time” part of it indexes “current chunks keyed by LOB id”.

There are two options for how long old versions of LOBs will be kept: PCTVERSION specifies the percentage of space below the segment’s highwater mark that may be used to keep old versions, and (until 12c, where things change) RETENTION specifies that Oracle should try to keep old versions for the length of time given by the system parameter undo_retention (which defaults to 900 seconds). If enough versions of LOBs have been kept Oracle can create a read-consistent version of a given LOB by using the normal undo mechanisms to take the base table row and LOBINDEX back to the correct point in time which will then ensure that the LOB pointers will be pointing to the correct chunks.  (If the LOB chunks have been over-written this is the point where you will get an Oracle error: “ORA-22924 Snapshot too old”, followed by a misleading “ORA-01555 Snapshot too old ….”)

One final interesting point from a performance perspective is that if you define the LOB to be “nocache”, which means that typical reads and writes of the lob will use direct path, and then specified “nologging” then reads and writes of the LOB will generate tiny amounts of redo log.  Two special points to go with this, though: if you specify “nocache logging” the direct path writes will be logged, but the log content will be by chunk – so if you store 4,000 bytes of data in a LOB with a 32KB chunk size you will write 32KB of redo log; secondly if you are testing the effects of logging and nologging, make sure your test database is running in archivelog mode if your production database is going to be archiving – otherwise Oracle will fool you by taking a short cut and NOT logging a nocache LOB even if you specify logging! The LOBINDEX is always cached and logged, by the way, and even if the LOB is defined as nocache there are circumstances where LOB blocks are read into the buffer cache (remember my previous note describing how we saw 6 billion buffer gets on a nocache LOB).

The last detail I want to mention is the FREEPOOLS parameter. The description in the developers guide for 11.2 describes this as:  “Specifies the number of FREELIST groups for BasicFiles LOBs, if the database is in automatic undo mode.” Unfortunately freelists and freelist groups are things that happen in manual segment space management so this definition requires an alternative meaning for the expression “FREELIST groups”. The purpose of FREEPOOLS is to help deal with concurrency problems but there’s not much information around to help you understand the mechanisms and pitfalls of freepools and the available documents on MoS don’t really do anything to clarify the position – and that’s what this article is (finally) going to talk about.

Basicfile FREEPOOLs – the truth is out there

When you specify FREEPOOLs you affect the way Oracle uses the LOBINDEX – not the space management information about the segment holding the index but the actual content of (in fact the KEY values held by) the index.

You can do a treedump of a LOBINDEX by object_id in the standard way that you do a treedump of any B-tree (or bitmap) index, and you can dump blocks from a LOBINDEX in the same way you dump any other data block in the database, by file number and block number (or block range), so it’s easy to see what happens in a LOBINDEX when you start using multiple freepools. I’ve created a table holding a LOB defined with “disable storage in row” so that I always use the LOBINDEX, inserted three rows then deleted one of them and dumped the one index block (which happens to be both the root and a leaf). Here’s the SQL to create the table and do the data handling:


create table t1(
        id      number constraint t1_pk primary key,
        c1      clob
)
lob (c1)
store as basicfile text_lob(
        disable storage in row
        chunk 8k
        retention
        nocache
        tablespace test_8k_assm
)
;


declare
        m_v1 varchar2(32767) := rpad('x',12000,'x');
begin
        for i in 1..3 loop
                insert into t1 values (i, m_v1);
                commit;
        end loop;
end;
/


delete from t1 where id = 1;
commit;

alter system flush buffer_cache;

I’ve ended by flushing the buffer cache so that I don’t get a huge trace file when I try to dump the index to disc. Here’s the next bit of processing:


SQL> select object_id from user_objects where object_type = 'INDEX' and object_name like 'SYS_IL%';

 OBJECT_ID
----------
    241599

SQL> alter session set events 'immediate trace name treedump level 241599';

----- begin tree dump
leaf: 0x1800204 25166340 (0: nrow: 4 rrow: 3)
----- end tree dump

SQL> alter session dump datafile 6 block 516;

I’ve included in the above the treedump that I extracted from the tracefile and this shows that the index consists of a single leaf block (0x1800204 = file 6 block 516) with 4 row directory entries of which one has been deleted. Here’s the row dump from that leaf block – the first three entries are the index entries identifying the three LOBs I created (and, as shown by the flag value “—D–“, the first has been marked as deleted) the fourth entry points to a set of free chunks (corresponding to the chunks that will become available for re-use after a delay corresponding to the undo retention time).


row#0[7982] flag: ---D--, lock: 2, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 a6 01 80 01 aa 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 85
col 1; len 4; (4):  00 00 00 00

row#1[7932] flag: ------, lock: 0, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 ae 01 80 01 b2 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 86
col 1; len 4; (4):  00 00 00 00

row#2[7882] flag: ------, lock: 0, len=50, data:(32):
 00 20 03 00 00 00 00 01 0f 1c 00 00 00 00 00 01 01 80 01 b6 01 80 01 ba 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 87
col 1; len 4; (4):  00 00 00 00

row#3[7832] flag: ------, lock: 2, len=50, data:(32):
 01 80 01 a6 01 80 01 aa 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00
col 0; len 10; (10):  00 01 57 b3 32 9b 00 00 00 00
col 1; len 4; (4):  01 80 01 a6

If you examine the entries closely you will see that despite the common structure of all four of them there are two patterns. Ignoring the “data (32):” portion and looking at just “col 0” the last few bytes of the first three entries hold consecutive numbers which are actually the LOB Ids for the three LOBs (derived from the sequences idgen1$). The fourth entry breaks that pattern and if you examine bytes 3 to 6 you will find that that is (approximately, by the time I publish this article) the number of seconds since 1st Jan 1970.

To a large degree you need only examine “col 0” to get a good idea of how Oracle handles the LOBINDEX, but I will say just a few things about the rest of the entry.  For the “reusable space” index entries “col 1” is the first of a list of up to 8 chunks that were released from the same LOB at that moment, and the “data(32)” is the full list of those 8 chunks – each chunk is identified by the block address of the first block of the chunk. If I had created and deleted a LOB of roughly 128KB I would have used 16 chunks to create it and made 16 chunks available on deletion, so there would have been two index entries with the same “col 0” value, each identifying 8 of the chunks – hence the re-appearance of the first chunk as “col 1”.  (You can’t help wondering why Oracle doesn’t squeeze 9 chunks per index entry rather than repeating the first of the list – maybe there’s a peripheral effect that makes 8 easier, maybe it’s simply a good sanity check mechanism.)

For index entries about current LOBs “col 1” is a counter for the index entries that identify the entire LOBs. Our LOBs were all very small so we only needed one index entry (which Oracle starts counting from zero).  The “data (32)” entry for the “zeroth” entry starts with 16 bytes of metadata then holds up to 4 pointers to chunks; subsequent entries don’t need the metadata and can hold up to 8 pointers each and “col 1” stores the chunk number that the index entry starts with, so “col 1” in consecutive index entries for a given LOB id will have values 0, 4, 12, 20, etc.

You might note, by the way, that my LOBs are not made up of consecutive blocks even though my chunk size is exactly one block. This is a side effect of ASSM (automatic segment space management) and nothing specifically to do with LOBs.

With this sketch in place you now have some idea of how a LOBINDEX works. Apart from the convenience of knowing roughly what information is stored in the index, and how it has this strange dual purpose, you can now view it just like any other B-tree index in Oracle. When you insert a LOB you insert some index entries into the middle of the index (the high-value point of the LOB Id bit), when you delete a LOB you mark some (consecutive) index entries as deleted and insert some index entries at the high end of the index the high_value point of the “reusable chunks” bit) because each delete is the most recent delete.

As soon as you’ve got that far you realise that if you have some degree of concurrency of inserts and deletes then you have two hot spots in the index – the mid point where you’re going to get lots of 50/50 splits as LOBs are inserted and the end point where all the re-usable chunks are indexed. So how do you configure Oracle, and what does Oracle do, to reduce the contention ?

Take another look at the “col 0” values – which I’ve cut out and listed in isolation below:


col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 85
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 86
col 0; len 10; (10):  00 00 00 01 00 00 09 d6 64 87

col 0; len 10; (10):  00 01 57 b3 32 9b 00 00 00 00

Apart from making it easy to see the sequencing in the 3 LOB ids it’s now easy to note that the first three (LOB) entries start with “00 00” while the last (reusable space) entry starts with “00 01”. It’s really this starting two bytes that makes it easy for Oracle to separate the current LOBs section of the index from the reusable space section. The two bytes are the freepool identifier – it’s the first (and only, in my example) free pool – but Oracle is counting from zero, doubling the counter for the current LOBs, and doubling and adding one for the reusable space.

Here are some results when I drop and recreate the table with freepools 4 and repeat the experiment. (I’ve removed the “data(32)” content to make the output a little cleaner, and then extracted the “col 0” values).


row#0[7982] flag: ---D--, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 55
col 1; len 4; (4):  00 00 00 00

row#1[7932] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 56
col 1; len 4; (4):  00 00 00 00

row#2[7882] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 57
col 1; len 4; (4):  00 00 00 00

row#3[7832] flag: ------, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 07 57 b3 3b a5 00 00 00 00
col 1; len 4; (4):  01 80 01 df

===

col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 55
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 56
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 57

col 0; len 10; (10): 00 07 57 b3 3b a5 00 00 00 00

It just happened that with 4 freepools available my session picked freepool 3 so its LOBINDEX entries are preceded with 00 06 (2 * 3), and it’s reusable space index entries are preceded with 00 07 (2 * 3 + 1). At present I think the freepool chosen by a session (counting from zero) is derived from the session’s process id (pid) by a simple mod(pid , freepools).

So what happens if I start a second session, and adjust my little PL/SQL procedure to insert rows 4, 5, and 6.

I expect to see two things. First, the “ordinary” B-tree event – the index entry that’s marked for deletion will be cleared out of the index; secondly I should see four new index entries (one marked as deleted) which, with a little luck (one chance in four), will show that they are associated with a different freepool.

Here’s the dump (again with the “data(32)” deleted, and the “col 0” extracted at the end):


row#0[7782] flag: ---D--, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 87
col 1; len 4; (4):  00 00 00 00

row#1[7732] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 88
col 1; len 4; (4):  00 00 00 00

row#2[7682] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 00 00 01 00 00 09 da 36 89
col 1; len 4; (4):  00 00 00 00

row#3[7632] flag: ------, lock: 2, len=50, data:(32):
col 0; len 10; (10):  00 01 57 b3 3b ad 00 00 00 00
col 1; len 4; (4):  01 80 01 a4

row#4[7932] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 56
col 1; len 4; (4):  00 00 00 00

row#5[7882] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 06 00 01 00 00 09 da 36 57
col 1; len 4; (4):  00 00 00 00

row#6[7832] flag: ------, lock: 0, len=50, data:(32):
col 0; len 10; (10):  00 07 57 b3 3b a5 00 00 00 00
col 1; len 4; (4):  01 80 01 df

===

col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 87
col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 88
col 0; len 10; (10): 00 00 00 01 00 00 09 da 36 89

col 0; len 10; (10): 00 01 57 b3 3b ad 00 00 00 00

col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 56
col 0; len 10; (10): 00 06 00 01 00 00 09 da 36 57

col 0; len 10; (10): 00 07 57 b3 3b a5 00 00 00 00

The index entry previously marked as deleted has disappeared (it was LOB id “09 da 36 55”).

We have four new index entries – the first 4 in the list above – and we can see that our second session has been allocated to freepool 0, the LOBINDEX entries are preceded by “00 00”, and the reusable space index entry is preceded by “00 01”.

So by declaring freepools N, we effectively break the index up into 2N nearly discrete sections. Half the sections get inserts at their high end as we insert new LOBs (with ever increasing LOB ids) and the other half (apart, sometimes, from the very top section) get inserts at the high end as time passes and we make LOB space available for reuse by deleting existing LOBs. (Note – if two LOBs of more than 8 chunks each are deleted in the same hundredth of a second then their index entries may end up interleaving as the full key is (timestamp, first chunk address) and the chunks may be scattered widely across the tablespace). Freepools allow Oracle to remove the two killer hot spots in the index.

There are side effects, of course: apart from the section for reusable space in the top freepool each section of the index will generally be subject to 50/50 block splits so you can expect the index to be roughly twice the optimum size – and even bigger than that due to other side effects of how it’s used if you’re constantly deleting and inserting LOBs. But size isn’t really the big problem; I’ll be examining further side effects of the LOBINDEX, and the mechanism that Oracle has for using the index, and the performance threats this introduces, in the next installment.


August 5, 2016

Basicfile LOBs 1

Filed under: Infrastructure,LOBs,Oracle — Jonathan Lewis @ 4:12 pm BST Aug 5,2016

I got a call to a look at a performance problem involving LOBs a little while ago. The problem was with an overnight batch that had about 40 sessions inserting small LOBs (12KB to 22KB) concurrently, for a total of anything between 100,000 and 1,000,000 LOBs per night. You can appreciate that this would eventually become a very large LOB segment – so before the batch started all LOBs older than one month were deleted.

The LOB column had the following (camouflaged) declaration:

 LOB (little_lob) STORE AS BASICFILE (
        TABLESPACE lob_ts
        ENABLE STORAGE IN ROW
        RETENTION
        NOCACHE
        LOGGING
)

The database was 11gR2, the tablespace was defined with ASSM with uniform 1MB extents and a blocksize of 8KB (so the LOBs were all 2 or 3 chunks) and the undo retention time was 900 seconds. The effect of the “enable storage in row” is that the LOBINDEX didn’t have to hold any details of current LOB chunks (for in-row, the first 12 chunks are listed in the LOB Locator in the base table).

So, examining an AWR report covering the critical interval, reviewing the captured ASH data, and checking the database, a few questions came to mind:

  • With 200 GB of current LOB data in the segment, why was the segment roughly 800GB ?
  • With no need for current LOBs to be indexed, how had the LOB Index reached 500,000 blocks in size ?
  • There had been 500,000 inserts that night – so why had Oracle done 6 Billion (cached) buffer gets on the (nocache) LOB segment ?
  • Given that the LOB Segment had not changed size during the night, why had there been millions of HW enqueue wait on the inserts ?

Knowing the stuff that I did know about basicfile LOBs it seemed likely that the most significant problem was that the segment hadn’t been created with multiple freepools which, according to the very sparse (and not entirely self-consistent) documentation, exist to allow improved concurrency. So I thought I’d search the Internet for any useful information about freepools, how they worked, what impact they might have on this problem, why their absence might produce the symptoms I’d seen, and what the best course of action would be to address the problem.

Of course the “correct” solution according to MoS would be to convert from basicfile to securefile – with a strange insistence on using online redefinition, but no explanation of why a simple CTAS or alter table move is undesirable or dangerous. Unfortunately there are a couple of notes on MoS describing performance issues with “high” levels of concurrent inserts that need to be addressed by setting hidden parameters so I’m not (yet) keen on rebuilding 700GB of a production system to produce a change that might still not work quickly enough; especially since I couldn’t find anything on MoS that could quantify the time needed to do the conversion.

To my surprise I couldn’t find a single useful piece of information about the problem. The only articles I could find seemed to be little bits of cut-n-paste from the Oracle manual pages about using multiple freepools, and the best of those actually demonstrated rebuilding or changing the freepools settings on a LOB of a few megabytes. The most significant MoS note did say that the process “could be slow” and would lock the table. But surely someone, somewhere, must have tried it on a big system and had some idea of “how slow”.

In the end I had to start building some simple models and doing a few experiments to find out what happens and where the time goes and what causes the strange results and – most importantly – how freepools might help. Fortunately, following a call to the Oak Table for any ideas or links to useful documents, I got a pointer to the original Oracle patents which were enormously helpful in showing why freepools could help and why, in the wrong circumstances, you could still end up with a (slightly smaller) disaster on your hands.

 

To be continued …

Footnote

If you’re interested, the patent numbers are: 5,999,943 and 6,061,678.  Now I just need someone to tell me the numbers for the securefile LOBs patents.

 

July 12, 2016

Union All MV

Filed under: Infrastructure,Materialized view,Oracle — Jonathan Lewis @ 10:10 am BST Jul 12,2016

In an article I wrote last week about Bloom filters disappearing as you changed a SELECT to a (conventional) INSERT/SELECT I suggested using the subquery_pruning() hint to make the optimizer fall back to an older strategy of partition pruning. My example showed this working with a range partitioned table but one of the readers reported a problem when trying to apply the strategy to a composite range/hash partitioned table and followed this up with an execution plan of a select statement with a Bloom filter where the subquery_pruning() hint didn’t introduced subquery pruning when the select was used for an insert.

A couple of standard ways to work around this probelm are to embed the select statement in a pipeline function so that we can “insert into table select from table(pipeline_function)”, or to write a pl/sql block that opens a cursor to do a select with bulk collect and loops through an array insert. The overhead in both cases is likely to be relatively small (especially when compared with the overhead of failing to filter). In this case, however, the reader suggested that maybe the problem appeared because the driving table (i.e. the one that would have been query to derive the pruning values) was actually an inline view with a union all.

After modifying my working model to try a couple of different tests I was inclined to agree. Since the two tables in the view looked as if they were likely to be relatively tiny and static I suggested that it would be safe to create a materialized view defined to “refresh on commit” and then use the materialized view explicitly in the query. This, finally, brings me to the point of today’s article – how do you create such a materialized view ?

I’m going to start by creating a couple of small base tables from a familiar object:


create table tt as select * from all_objects where object_type = 'TABLE';
create table tv as select * from all_objects where object_type = 'VIEW';

alter table tt add constraint tt_pk primary key (object_id);
alter table tv add constraint tv_pk primary key (object_id);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'TT',
                method_opt       => 'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'TV',
                method_opt       => 'for all columns size 1'
        );
end;
/

Assume, now, that I need an inline view that is interested in the things you will recognise from the above as the tables owned by OUTLN (which will apper in tt) and the views owned by SYSTEM (which will appear in tv) – in the 11.2.0.4 system I’m playing on at the moment that’s three rows from each of the two tables). Here’s the SQL I’d put into the inline view:


select
        object_id, object_type, object_name
from    tt
where   owner = 'OUTLN'
union all
select
        object_id, object_type, object_name
from    tv
where   owner = 'SYSTEM'
;

Since this view won’t give me partition pruning I have to replace it with a table and because I want to ensure that the table is always up to date I have to generate it as the container for a materialized view with refresh on commit. First I need some materialized view logs so that I can do a fast refresh:


create materialized view log on tt
with
        rowid, primary key
        (object_type, object_name, owner)
including new values
;

create materialized view log on tv
with
        rowid, primary key
        (object_type, object_name, owner)
including new values
;

I’ve included the primary key in the definition because I happen to want the object_id column in the log – but I could just have included it as a column in the filter list. I’ve included the rowid in the definition because Oracle needs the rowid if it’s going to be able to do a fast refresh. I can now create a materialized view:


create materialized view mv_t
        build immediate
        refresh fast on commit
as
select
        'T' mv_marker,
        rowid rid,
        object_id, object_type, object_name
from    tt
where   owner = 'OUTLN'
union all
select
        'V' mv_marker,
        rowid rid,
        object_id, object_type, object_name
from    tv
where   owner = 'SYSTEM'
;

I’ve taken the option to “build immediate” and specified – most importantly for my needs – “refresh on commit”. You’ll notice I haven’t chosen to “enable query rewrite”; for the purposes of this demo I don’t need that particular feature.

There are two key features to the materialized view that are a little special – first I’ve included the rowid of each source table as a named column in the materialized view; as I mentioned above Oracle will not allow the view to be fast refreshable without the rowid. The second feature is that I’ve introduced a literal value into the view which I’ve named mv_marker; this makes it easy to see which table a row comes from when you query the materialized view … and Oracle needs to see this.

That’s the job done. Just to demonstrate that my materialized view is working as required here’s a little more SQL (following by the output):


select * from mv_t;

delete from tt where object_name = 'OL$';
update tv set object_name = 'PRODUCT_PRIVILEGES' where object_name = 'PRODUCT_PRIVS';

commit;

select * from mv_t;

=======================================

M RID                 OBJECT_ID OBJECT_TYPE         OBJECT_NAME
- ------------------ ---------- ------------------- --------------------------------
T AAA6tXAAFAAAAEBAAI        471 TABLE               OL$
T AAA6tXAAFAAAAEBAAJ        474 TABLE               OL$HINTS
T AAA6tXAAFAAAAEBAAK        478 TABLE               OL$NODES
V AAA6tWAAFAAAACgABI       8260 VIEW                SCHEDULER_PROGRAM_ARGS
V AAA6tWAAFAAAACgABJ       8261 VIEW                SCHEDULER_JOB_ARGS
V AAA6tWAAFAAAACuAA7      14233 VIEW                PRODUCT_PRIVS

6 rows selected.

2 rows deleted.


1 row updated.


Commit complete.


M RID                 OBJECT_ID OBJECT_TYPE         OBJECT_NAME
- ------------------ ---------- ------------------- --------------------------------
T AAA6tXAAFAAAAEBAAJ        474 TABLE               OL$HINTS
T AAA6tXAAFAAAAEBAAK        478 TABLE               OL$NODES
V AAA6tWAAFAAAACgABI       8260 VIEW                SCHEDULER_PROGRAM_ARGS
V AAA6tWAAFAAAACgABJ       8261 VIEW                SCHEDULER_JOB_ARGS
V AAA6tWAAFAAAACuAA7      14233 VIEW                PRODUCT_PRIVILEGES

5 rows selected.

If you’re wondering why you see “2 rows deleted” but a reduction by just one row in the final output, remember that we’re deleting from table tt but the materialized view holds information about just the subset of tables owned by OUTLN – I happen to have a row in tt that says SYSTEM also owns a table called OL$.

Assistance

If you have trouble working out why your attempts to create a particular materialized view aren’t working the dbms_mview package has a procedure called explain_mview that may give you enough ideas to work out what you’re doing wrong. For example, here’s how I could find out that I needed a literal column to tag the two parts of my union all view:


@$ORACLE_HOME/rdbms/admin/utlxmv.sql

begin
        dbms_mview.explain_mview (
                q'{
                create materialized view mv_t
                        build immediate
                        refresh fast
                        enable query rewrite
                as
                select  -- 'T' mv_marker,
                        rowid rid,
                        object_id, object_type, object_name from tt
                union all
                select  -- 'V' mv_marker,
                        rowid rid,
                        object_id, object_type, object_name from tv
                }'
        );
end;
/

column cap_class noprint
column related_text format a7
column short_msg format a72
break on cap_class skip 1

select
        substr(capability_name,1,3) cap_class,
        capability_name, possible, related_text, substr(msgtxt,1,70) short_msg
from
        mv_capabilities_table
where
        mvname = 'MV_T'
order by
        substr(capability_name,1,3), related_num, seq
;

The first line calls a supplied script to create a table called mv_capabilities_table in the current schema. The call to dbms_mview.explain_mview passes the text of a “create materialized view” statement to the procedure (there are a couple of variations possible) then, after a couple of SQL*Plus formatting commands I’ve queried the table to see Oracle’s analysis for the statement. (You can tag each call to this procedure using a second parameter that I haven’t bothered to use.)

Here’s the output for the failed attempt above, which has commented out the literals that tag the two parts of the UNION ALL:

CAPABILITY_NAME                POS RELATED SHORT_MSG
------------------------------ --- ------- ------------------------------------------------------------------------
PCT_TABLE                      N   TT      relation is not a partitioned table
PCT_TABLE_REWRITE              N   TT      relation is not a partitioned table
PCT_TABLE                      N   TV      relation is not a partitioned table
PCT_TABLE_REWRITE              N   TV      relation is not a partitioned table
PCT                            N

REFRESH_COMPLETE               Y
REFRESH_FAST                   N
REFRESH_FAST_AFTER_INSERT      N           the materialized view does not have a UNION ALL marker column
REFRESH_FAST_AFTER_INSERT      N           set operator in a context not supported for fast refresh
REFRESH_FAST_AFTER_ONETAB_DML  N           see the reason why REFRESH_FAST_AFTER_INSERT is disabled
REFRESH_FAST_AFTER_ANY_DML     N           see the reason why REFRESH_FAST_AFTER_ONETAB_DML is disabled
REFRESH_FAST_PCT               N           PCT FAST REFRESH is not possible if query has set operand query blocks

REWRITE                        Y
REWRITE_FULL_TEXT_MATCH        Y
REWRITE_PARTIAL_TEXT_MATCH     N           set operator encountered in mv
REWRITE_GENERAL                N           set operator encountered in mv
REWRITE_PCT                    N           general rewrite is not possible or PCT is not possible on any of the d


17 rows selected.

The query manages to split the output into three sections (but that depends on a side-effect in a way that I would normally call bad design): elements relating to “Partition Change Tracking”, elements relating to “Materialized View Refresh” and elements relating to “Query Rewrite”. You’ll notice that the rewrite section tells me that (even though I haven’t chosen to enable it) my view could be enabled to do query rewrite.

Critically, though, this version of the materialized view can’t be fast refreshed, and we see the key reason in the first “Refresh fast after insert” line: “the materialized view does not have a UNION ALL marker column”. That’s how I know I have to include a literal column that has a different value in each of the two parts of the UNION ALL.

June 24, 2016

Never …

Filed under: Infrastructure,Oracle,Philosophy — Jonathan Lewis @ 1:15 pm BST Jun 24,2016

From time to time a question comes up on OTN that results in someone responding with the mantra: “Never do in PL/SQL that which can be done in plain  SQL”. It’s a theme I’ve mentioned a couple of times before on this blog, most recently with regard to Bryn Llewellyn’s presentation on transforming one table into another and Stew Ashton’s use of Analytic functions to solve a problem that I got stuck with.

Here’s a different question that challenges that mantra. What’s the obvious reason why someone might decide to produce the following code rather than writing a simple “insert into t1 select * from t2;”:


declare

        cursor c1 is
        select * from t2
        ;

        type c1_array is table of c1%rowtype index by binary_integer;
        m_tab c1_array;

begin

        open c1;
        loop
                fetch c1
                bulk collect into m_tab limit 100;

                begin
                        forall i in 1..m_tab.count
                                insert into t1 values m_tab(i);
                exception
                        when others
                                then begin
                                        --  proper exception handling should go here
                                        dbms_output.put_line(m_tab(1).id);
                                        dbms_output.put_line(sqlerrm);
                                end;
                end;

                exit when c1%notfound;

        end loop;
        close c1;
end;
/

There is a very good argument for this approach.

Follow-up (Saturday 25th)

As Andras Gabor pointed out in one of the comments, there are documented scenarios where the execution plan for a simple select statement is not legal for the select part of an “insert into .. select …” statement. Specifically, if you have a distributed query the most efficient execution plan may require the remote site to be the driving site, but the plan for a CTAS or insert/select is required to use the local site as the driving site.

There are workarounds – if you’re allowed to use them – such as creating a view at the remote site and selecting from the view, or you could create a pipelined function locally and select from the pipelined function (but that’s going to be writing PL/SQL anyway, and you’d have to create one or two object types in the database to implement it).s

Another example of plan limitations, that I had not seen before (but have now found documented as “not a bug in MoS note 20112932”), showed up in a comment from Louis: a select statement may run efficiently because the plan uses a Bloom filter, but the filter disappears when the statement is used in insert/select.

These limitations, however, were not the point I had in mind. The “obvious” reason for taking the pl/sql approach is error handling. What happens if one of the rows in your insert statement raises an Oracle exception ? The entire statement has to rollback. If you adopt the PL/SQL array processing approach then you can trap each error as it occurs and decide what to do about it – and there’s an important detail behind that statement that is really important: the PL/SQL can operate at virtually the same speed as the simple SQL statement once you’ve set the arraysize to a value which allows each insert to populate a couple of blocks.

Let me emphasise the critical point of the last sentence:  array inserts in PL/SQL operate at (virtually) the speed of the standard SQL insert / select.

As it stands I don’t think the exception handler in my code above could detect which row in the batch had caused the error – I’ve just printed the ID from the first row in the batch as a little debug detail that’s only useful to me because of my knowledge of the data. Realistically the PL/SQL block to handle the inserts might look more like the following:


-- In program declaration section

        dml_errors      exception;
        pragma exception_init(dml_errors, -24381);

        m_error_pos     number(6,0)     := 0;

-- ------------------------------

                begin
                        forall i in 1..m_tab.count save exceptions
                                insert into t1 values m_tab(i);
                exception
                        when dml_errors then begin

                                for i in 1..sql%bulk_exceptions.count loop

                                        dbms_output.put_line(
                                                'Array element: ' ||
                                                        sql%bulk_exceptions(i).error_index || ' ' ||
                                                        sqlerrm(-sql%bulk_exceptions(i).error_code)
                                        );

                                        m_error_pos := sql%bulk_exceptions(i).error_index;
                                        dbms_output.put_line(
                                                'Content: ' || m_tab(m_error_pos).id || ' ' || m_tab(m_error_pos).n1
                                        );

                                end loop;
                        end;

                        when others then raise;
                end;


You’ll notice that I’ve added the SAVE EXCEPTIONS clause to the FORALL statement. This allows Oracle to trap any errors that occur in the array processing step and record details of the guilty array element as it goes along, storing those details in an array calls SQL%BULK_EXCEPTIONS. My exception handler then handles the array processing exception by walking through that array.

I’ve also introduced an m_error_pos variable (which I could have declared inside the specific exception handler) to remove a little of the clutter from the line that shows I can identify exactly which row in the source data caused the problem. With a minimum of wasted resources this code now inserts all the valid rows and reports the invalid rows (and, if necessary, could take appropriate action on each invalid row as it appears).

If you’ve got a data loading requirement where almost all the data is expected to be correct but errors occasionally happen, this type of coding strategy is likely to be the most efficient thing you could do to get your data into the database. It may be slightly slower when there are no errors, but that’s a good insurance premium when compared with the crash and complete rollback that occurs if you take the simple approach – and there are bound to be cases where a pre-emptive check of all the data (that would, probably, make the insert safe) would add far more overhead than the little bit of PL/SQL processing shown here.

Results

It’s obviously a little difficult to produce any time-based rates that demonstrate the similarity in performance of the SQL and PL/SQL approaches – the major time component in a little demo I built was about the I/O rather than the the CPU (which, in itself, rather validates the claim anyway). But if you want to do some testing here’s my data model with some results in the following section:


rem
rem     Script: plsql_loop_insert.sql
rem     Author: Jonathan Lewis
rem

execute dbms_random.seed(0)

create table t1
nologging
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        cast(rownum as number(8,0))                     id,
        2 * trunc(dbms_random.value(1e10,1e12))         n1,
        cast(lpad('x',100,'x') as varchar2(100))        padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;
create table t2
nologging
noparallel
as
select
        /*+ no_parallel(t1) */
        id + 1e6        id,
        n1 - 1          n1,
        rpad('x',100,'x') padding
from t1 
;

-- update t2 set n1 = n1 + 1 where id = 2e6;
-- update t2 set n1 = n1 + 1 where id = 2e6 - 10;
-- update t2 set n1 = n1 + 1 where id = 2e6 - 20;
-- update t2 set n1 = n1 + 1 where id = 1750200;
-- update t2 set n1 = n1 + 1 where id = 1500003;
-- update t2 set n1 = n1 + 1 where id = 1500001;
commit;

alter system checkpoint;
alter system switch logfile;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T2',
                method_opt       => 'for all columns size 1'
        );
end;
/

create unique index t1_i1 on t1(n1) nologging;
create unique index t1_pk on t1(id) nologging;
alter table t1 add constraint t1_pk primary key(id);


I’ve generated 1 million rows with an id column and a random integer – picking the range of the random numbers to give me a very good chance (that worked) of getting unique set of values. I’ve doubled the random values I use for t1 so that I can substract 1 and still guarantee uniqueness when I generate the t2 values (I’ve also added 1 million to the id value for t2 for the same uniqueness reasons).

The optional update to add 1 to a scattering of rows in t2 ensures that those values go back to their original t1 values so that they can cause “duplicate key” errors. The SQL insert was a simple insert into t1 select * from t2 (ensuring that parallel query didn’t come into play), and the PL/SQL detail I used was as follows:

declare

        cursor c1 is
        select /*+ no_parallel(t2) */ * from t2
        ;

        type c1_array is table of c1%rowtype index by binary_integer;
        m_tab c1_array;

        dml_errors      exception;
        pragma exception_init(dml_errors, -24381);

        m_error_pos     number(6,0)     := 0;

begin

        open c1;
        loop
                fetch c1
                bulk collect
                into m_tab limit 100;

                begin
                        forall i in 1..m_tab.count save exceptions
                                insert into t1 values m_tab(i);

                exception
                        when dml_errors then begin

                                for i in 1..sql%bulk_exceptions.count loop

                                        dbms_output.put_line(
                                                'Array element: ' ||
                                                        sql%bulk_exceptions(i).error_index || ' ' ||
                                                        sqlerrm(-sql%bulk_exceptions(i).error_code)
                                        );

                                        m_error_pos := sql%bulk_exceptions(i).error_index;
                                        dbms_output.put_line(
                                                'Content: ' || m_tab(m_error_pos).id || ' ' || m_tab(m_error_pos).n1
                                        );

                                end loop;
                        end;

                        when others then raise;

                end;

                exit when c1%notfound;  -- when fetch < limit

        end loop;
        close c1;
end;
/

The PL/SQL output with one bad row (2e6 – 20) looked like this:

Array element: 80 ORA-00001: unique constraint (.) violated
Content: 1999980 562332925640

Here are some critical session statistics for different tests in 11g:


No bad data, insert select
--------------------------
Name                                                 Value
----                                                 -----
CPU used when call started                             944
CPU used by this session                               944
DB time                                              1,712
redo entries                                     1,160,421
redo size                                      476,759,324
undo change vector size                        135,184,996

No bad data, PL/SQL loop
------------------------
Name                                                 Value
----                                                 -----
CPU used when call started                             990
CPU used by this session                               990
DB time                                              1,660
redo entries                                     1,168,022
redo size                                      478,337,320
undo change vector size                        135,709,056


Duplicate Key (2e6-20), insert select (with huge rollback)
----------------------------------------------------------
Name                                                 Value
----                                                 -----
CPU used when call started                           1,441
CPU used by this session                             1,440
DB time                                              2,427
redo entries                                     2,227,412
redo size                                      638,505,684
undo change vector size                        134,958,012
rollback changes - undo records applied          1,049,559

Duplicate Key (2e6-20), PL/SQL loop - bad row reported
------------------------------------------------------
Name                                                 Value
----                                                 -----
CPU used when call started                             936
CPU used by this session                               936
DB time                                              1,570
redo entries                                     1,168,345
redo size                                      478,359,528
undo change vector size                        135,502,488
rollback changes - undo records applied                 74

Most of the difference between CPU time and DB time in all the tests was file I/O time (in my case largerly checkpoint wait time, I had small log files, but in larger systems it’s quite common to see a lot of time spent on db file sequential reads as index blocks are read for update). You can see that there’s some “unexpected” variation in CPU time – I wasn’t expecting the PL/SQL loop that failed after nearly 1M inserts to use less CPU than anything else – but the CPU numbers fluctuated a few hundredths of a second across tests, this just happened to be particularly noticeable with the first one I did – so to some extent this was probably affected by background activity relating to space management, job queue processing and all the other virtual machines on the system.

Critically I think it’s fair to say that the differences in CPU timing are not hugely significant across a reasonably sized data set, and most importantly the redo and undo hardly vary at all between the successful SQL and both PL/SQL tests. The bulk processing PL/SQL approach doesn’t add a dramatic overhead – but it clearly does bypass the threat of a massive rollback.

Footnote:

You might want to argue the case for using basic SQL with the log errors clause. The code method is simple and it gives you a table of rows which have caused exceptions as the insert executed – and that may be sufficient for your purposes; but there’s a problem until you upgrade to 12c.

Here’s how I had to modify my test case to demonistrate the method:


begin
        dbms_errlog.create_error_log('t1');
end;
/

insert into t1 select * from t2
log errors
reject limit unlimited
;

The procedure call creates a table to hold the bad rows, by default it’s name will be err$_t1, and it will be a clone of the t1 table with changes to column types (which might be interseting if you’ve enable 32K columns in 12c — to be tested) and a few extra columns:


SQL> desc err$_t1
 Name                          Null?    Type
 ----------------------------- -------- --------------------
 ORA_ERR_NUMBER$                        NUMBER
 ORA_ERR_MESG$                          VARCHAR2(2000)
 ORA_ERR_ROWID$                         ROWID
 ORA_ERR_OPTYP$                         VARCHAR2(2)
 ORA_ERR_TAG$                           VARCHAR2(2000)
 ID                                     VARCHAR2(4000)
 N1                                     VARCHAR2(4000)
 PADDING                                VARCHAR2(4000)

SQL> execute print_table('select * from err$_t1')
ORA_ERR_NUMBER$               : 1
ORA_ERR_MESG$                 : ORA-00001: unique constraint (TEST_USER.T1_I1) violated

ORA_ERR_ROWID$                :
ORA_ERR_OPTYP$                : I
ORA_ERR_TAG$                  :
ID                            : 1999980
N1                            : 562332925640
PADDING                       : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

SO what’s the problem with logging errors ? Here are the sets of session stats corresponding to the ones that I reported above for the SQL and PL/SQL options. The first set comes from running this test on 11.2.0.4, the second from 12.1.0.2:


11g results
===========
Name                                                 Value
----                                                 -----
CPU used when call started                           1,534
CPU used by this session                             1,534
DB time                                              2,816
redo entries                                     3,113,105
redo size                                      902,311,860
undo change vector size                        269,307,108

12c results
===========
Name                                                 Value
----                                                 -----
CPU used when call started                             801
CPU used by this session                               801
DB time                                              3,061  -- very long checkpoint waits !!
redo entries                                     1,143,342
redo size                                      492,615,336
undo change vector size                        135,087,044


Ihe 12c stats are very sinilar to the stats from the perfect SQL run and the two PL/SQL runs – but if you look at the 11g stats you’ll see that they’re completely different from all the other stats. The number of redo entries (if nothing else) tells you that Oracle has dropped back from array processing to single row processing in order to be able to handle the error logging (1 million rows, one entry for each row, it’s PK index entry, and the unique key index entry.)

Until 12c error logging is just row by row processing.

Footnote:

As far as I can tell, I first pointed out this “single row processing” aspect of the log errors option some time around December 2005.

Late Entry:

While looking for a posting about efficient updates  I came across another of my posting that compares SQL with PL/SQL for updates – it’s worth a read.

 

May 23, 2016

Virtual Partitions

Filed under: 12c,CBO,Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 1:16 pm BST May 23,2016

Here’s a story of (my) failure prompted by a recent OTN posting.

The OP wants to use composite partitioning based on two different date columns – the table should be partitioned by range on the first date and subpartitioned by month on the second date. Here’s the (slightly modified) table creation script he supplied:


rem
rem     Script: virtual_partition.sql
rem     Dated:  May 2016
rem

CREATE TABLE M_DTX
(
        R_ID    NUMBER(3),
        R_AMT   NUMBER(5),
        DATE1   DATE,
        DATE2   DATE,
        VC GENERATED ALWAYS AS (EXTRACT(MONTH FROM DATE2))
)
PARTITION BY RANGE (DATE1) interval (numtoyminterval(1,'MONTH'))
SUBPARTITION BY LIST (VC)
        SUBPARTITION TEMPLATE (
                SUBPARTITION M1 VALUES (1),
                SUBPARTITION M2 VALUES (2),
                SUBPARTITION M3 VALUES (3),
                SUBPARTITION M4 VALUES (4),
                SUBPARTITION M5 VALUES (5),
                SUBPARTITION M6 VALUES (6),
                SUBPARTITION M7 VALUES (7),
                SUBPARTITION M8 VALUES (8),
                SUBPARTITION M9 VALUES (9),
                SUBPARTITION M10 VALUES (10),
                SUBPARTITION M11 VALUES (11),
                SUBPARTITION M12 VALUES (12)
        )
        (
        PARTITION M_DTX_2015060100 VALUES LESS THAN (TO_DATE('2015-06-01 00:00:01', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
        )
;

There’s nothing particularly exciting about this – until you get to the query requirement – the user wants to query on date1 and date2, and doesn’t know about the virtual month column, e.g. (and, I know that there should be a to_date() or ANSI equivalent here):

SELECT * FROM m_dtx WHERE date1 = trunc(sysdate) AND date2 = '01-Jun-2016';

Now, as a general rule, you don’t expect partition elimination to occur unless the partitioning column appears with a predicate that make elimination possible, so your first response to this query is that it could eliminate on date1, but can’t possibly eliminiate on vc because vc isn’t in the where clause. However it’s possible that the partitioning code might be coded to recognise that the subpartition is on a virtual column that is derived from date2, so perhaps it could generate a new predicate before optimising, for example:

date2 = '01-Jun-2016'  => vc = 6

Unfortunately, your first response is correct – the optimizer doesn’t get this clever, and doesn’t do the sub-partition elimination. Here’s the execution plan from 12.1.0.2 for the sample query, followed by the execution plan when I explicitly add the predicate vc = 6.


SQL_ID  8vk1a05uv16mb, child number 0
-------------------------------------
SELECT /*+ dynamic_sampling(0) */  * FROM m_dtx WHERE date1 =
trunc(sysdate) AND date2 = to_date('01-Jun-2016','dd-mon-yyyy')

Plan hash value: 3104206240

------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |       |       |    15 (100)|          |       |       |
|   1 |  PARTITION RANGE SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
|   2 |   PARTITION LIST ALL   |       |     1 |    57 |    15   (7)| 00:00:01 |     1 |    12 |
|*  3 |    TABLE ACCESS FULL   | M_DTX |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("DATE2"=TO_DATE(' 2016-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "DATE1"=TRUNC(SYSDATE@!)))



SQL_ID  33q012bdhjrpn, child number 0
-------------------------------------
SELECT /*+ dynamic_sampling(0) */  * FROM m_dtx WHERE date1 =
trunc(sysdate) AND date2 = to_date('01-Jun-2016','dd-mon-yyyy') and vc
= 6

Plan hash value: 938710559

------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |       |       |    15 (100)|          |       |       |
|   1 |  PARTITION RANGE SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
|   2 |   PARTITION LIST SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |     6 |     6 |
|*  3 |    TABLE ACCESS FULL   | M_DTX |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("DATE2"=TO_DATE(' 2016-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "DATE1"=TRUNC(SYSDATE@!)))


Note how the predicate vc = 6  doesn’t show up in the predicate section in either case, but the execution plan shows PARTITION LIST ALL at operation 2 when we omit the predicate and PARTITION LIST SINGE when we include it (with suitable values also appearing for Pstart and Pstop). (The cost, by the way, is the cost of scanning a whole (range)partition whether or not the optimizer expects to restrict that scan to just one sub-partition.)

So the optimizer isn’t quite clever enough (yet). BUT … the optimizer can be very clever with constraints, combining constraints with predicates and applying transitive closure to produce new predicates – so maybe we could get the optimizer to do this if we helped it a little bit. Given the table definition supplied I’m going to assume that the date2 column is supposed to be non-null, so let’s add some truthful constraints/declarations to the table definition:


alter table m_dtx modify date2 not null;
alter table m_dtx modify vc  not null;
alter table m_dtx add constraint md_ck_vc check (vc = extract(month from date2));

Alas, this didn’t make any difference to the execution plan. But it did do something surprising to my attempts to load data into the table:


insert into m_dtx (r_id, r_amt, date1, date2)
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        mod(rownum, 1000),
        rownum,
        trunc(sysdate,'yyyy') + dbms_random.value(0,365),
        trunc(sysdate,'yyyy') + dbms_random.value(0,365)
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4
;

insert into m_dtx (r_id, r_amt, date1, date2)
*
ERROR at line 1:
ORA-01400: cannot insert NULL into (???)

So the array insert with the virtual column doesn’t like the NOT NULL constraint on the virtual column because vc is, presumably, still null when the constraint is checked (though there’s no problem with single row inserts with the values() clause – I wonder what happens with the PL/SQL “FORALL” clause) – so let’s remove the not null constraint on vc and see what happens.


insert into m_dtx (r_id, r_amt, date1, date2)
*
ERROR at line 1:
ORA-02290: check constraint (TEST_USER.MD_CK_VC) violated

Unsurprisingly, given the fact that Oracle didn’t like the not null constraint, the critical check constraint also fails. This, by the way, is odd because a check constraint should accept a row when the constraint doesn’t evaluate to FALSE, so (a) vc can’t have been evaluated at this point or the constraint would evaluate to TRUE – which is not FALSE, and (b) vc at this point can no longer be null or the constraint would evaluate to NULL – which is not FALSE: so what “value” has vc got that makes the constraint check return FALSE ?

Bottom line:

I can see some scope for an optimizer enhancement that tries to find eliminating predicates from virtual columns; and I think there’s a need for ensuring that we can safely add constraints to virtual columns – after all we might want to create an index on a virtual column and sometimes we need a NOT NULL declaration to ensure that an index-only execution path can be found. Unfortunately I have to end this blog without finding an immediate solution for the OP.

Despite this failure, though, there are cases (as I showed a couple of years ago) where the optimizer in 12c can get clever enough to recognize the connection between a queried date column and the virtual partitioning column based on that date column.

May 9, 2016

RI Locks

Filed under: deadlocks,Indexing,IOT,Locks,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 12:24 pm BST May 9,2016

RI = Referential Integrity: also known informally as parent/child integrity, and primary (or unique) key/foreign key checking.

I’m on a bit of a roll with things that I must have explained dozens or even hundreds of times in different environments without ever formally explaining them on my blog. Here’s a blog item I could have done with to response to  a question that came up on the OTN database forum over the weekend.

What happens in the following scenario:


-- session 1

create table parent (
        id        number(8,0),
        constraint par_pk primary key(id)
);

create table child  (
        id_p      number(8,0) not null references parent,
        id_c      number(8,0) not null,
        constraint child_pk primary key(id_p, id_c)
)
;

insert into parent values(1);

-- session 2
insert into child values(1,1);

Since the parent row corresponding to the child row doesn’t (yet) seem to exist as far as session 2 is concerned you might expect session 2 to respond immediately with an error message like:

ERROR at line 1:
ORA-02291: integrity constraint (TEST_USER.SYS_C0017926) violated - parent key not found

In fact, although the end-user is not allowed to see the uncommitted parent row, the user’s process can see the uncommitted row and will wait until session 1 commits or rolls back – so if you examine v$lock for the current locks for the two sessions you’d see something like this:

  1  select  sid, type, id1, id2, lmode, request, ctime, block
  2  from    V$lock
  3  where   sid in (select sid from V$session where username = 'TEST_USER')
  4  and     type != 'AE'
  5  order by
  6*         sid, type desc
  7  /

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
         3 TX     327709      12584          6          0        283          1
           TM     143734          0          2          0        283          0
           TM     143732          0          3          0        283          0

       250 TX     589829      12877          6          0        240          0
           TX     327709      12584          0          4        240          0
           TM     143734          0          3          0        240          0
           TM     143732          0          3          0        240          0


7 rows selected.

In the above, SID 250 is session 2: it’s holding a transaction lock (TX) in mode 6 because it has acquired an undo segment and has generated some undo, it’s also waiting for a transaction lock in mode 4 (share) and – checking id1 and id2 – we can see that the transaction table entry it’s waiting for is held by session 3 in mode 6 (and we also note that the lock held by session 3 is marked as a blocker).

If session 3 commits (thus releasing the transaction lock) session 250 will continue processing the insert; if session 3 rolls back session 250 will raise error ORA-02291 and roll back its insert statement. (Note: if this were a multi-statement transaction it would only be the insert into child that would be rolled back; that’s another one of those details that is important but often isn’t stated explicitly, leaving people believing that the entire transaction would be rolled back.)

Updates and deletes can produce the same effects. Imagine that we have just created the two tables, and then run the following:


-- session 1
insert into parent values(1);
commit;
delete from parent where id = 1;

-- session 2
insert into child values(1,1);

Again session 2 will wait for session 1 to commit or roll back. In this case if session 1 commits session 2 will raise Oracle error ORA-02291, if session 1 rolls back session 2 will continue with the insert.

Deadlocks

Whenever you can demonstrate a way of producing a wait chain you can also manage to produce a deadlock. Consider the following (starting, again, from empty tables);


-- (1) session 1
insert into parent values(1);

-- (2) session 2
insert into parent values(2);

-- (3) session 1
insert into child values(2,2);

-- (4)session 2
insert into child values(1,1);

Session 1 will start waiting for session 2 to commit (or rollback) at step 3, then session 2 will start to wait for session 1 at step 4 – with the result that session 1 will recognise the deadlock after about three seconds and rollback its last statement, raising exception ORA-00060 and dumping a trace file. (Note: session 1 will not, as many people think, roll back the entire transaction, it will only roll back the statement that allowed the deadlock to develop). Session 2 will still be waiting for session 1 to commit or rollback its insert into parent. Contrary to the popular claim, Oracle will not “resolve” the deadlock, it will simply break the deadlock leaving one session waiting for the other session to respond appropriately to the deadlock error.

For reference, here’s the deadlock graph (from a 12c trace file) produced by session 1 (SID = 3) for this demo:


Deadlock graph:
                                          ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name                             process session holds waits  process session holds waits
TX-00010017-000026C7-00000000-00000000          6       3     X             33     250           S
TX-000A000D-000026F8-00000000-00000000         33     250     X              6       3           S

session 3: DID 0001-0006-00000004       session 250: DID 0001-0021-00000041
session 250: DID 0001-0021-00000041     session 3: DID 0001-0006-00000004

Rows waited on:
  Session 3: no row
  Session 250: no row

When you see a deadlock graph with TX waits of type S (share, mode 4) it’s a very good bet that the wait has something to do with indexes – which may mean referential integrity as discussed here, but may mean collisions on primary keys, and may mean something to do with simple collisions on index-organized tables. You’ll notice that the “Rows waited on:” section shows no row – unfortunately in earlier versions of Oracle you may find a spurious row entry here because the wait information from some other (block) wait has been left in the relevant columns in v$session.

May 3, 2016

Debugging

Filed under: CBO,compression,Execution plans,Infrastructure,Oracle,Uncategorized — Jonathan Lewis @ 8:11 am BST May 3,2016

The OTN database forum supplied a little puzzle a few days ago – starting with the old, old, question: “Why is the plan with the higher cost taking less time to run?”

The standard (usually correct) answer to this question is that the optimizer doesn’t know all it needs to know to predict what’s going to happen, and even if it had perfect information about your data the model used isn’t perfect anyway. This was the correct answer in this case, but with a little twist in the tail that made it a little more entertaining. Here’s the query, with the two execution plans and the execution statistics from autotrace:


SELECT  /* INDEX(D XPKCLIENT_ACCOUNT) */ 
        E.ECID,A.acct_nb
FROM    
        client_account d, 
        client         e, 
        account        a
where
        A.acct_nb ='00000000000000722616216'</li>


AND     D.CLNT_ID = E.CLNT_ID
AND     D.ACCT_ID=A.ACCT_ID;

Plan (A) with a full tablescan of client_account – cost 808, runtime 1.38 seconds, buffer gets 17,955


-------------------------------------------------------------------------------------------------
| Id | Operation                      | Name           | Rows  | Bytes  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |                |     1 |    59  |   808 (14) | 00:00:10 |
|  1 |  NESTED LOOPS                  |                |     1 |    59  |   808 (14) | 00:00:10 |
|  2 |   NESTED LOOPS                 |                |     1 |    59  |   808 (14) | 00:00:10 |
|* 3 |    HASH JOIN                   |                |     1 |    42  |   806 (14) | 00:00:10 |
|  4 |     TABLE ACCESS BY INDEX ROWID| ACCOUNT        |     1 |    30  |     5  (0) | 00:00:01 |
|* 5 |      INDEX RANGE SCAN          | XAK1ACCOUNT    |     1 |        |     4  (0) | 00:00:01 |
|  6 |     TABLE ACCESS FULL          | CLIENT_ACCOUNT |  9479K|   108M |   763 (10) | 00:00:09 |
|* 7 |    INDEX UNIQUE SCAN           | XPKCLIENT      |     1 |        |     1  (0) | 00:00:01 |
|  8 |   TABLE ACCESS BY INDEX ROWID  | CLIENT         |     1 |    17  |     2  (0) | 00:00:01 |
-------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
     0  recursive calls
     0  db block gets
 17955  consistent gets
     0  physical reads
     0  redo size
   623  bytes sent via SQL*Net to client
   524  bytes received via SQL*Net from client
     2  SQL*Net roundtrips to/from client
     0  sorts (memory)
     0  sorts (disk)
     1  rows processed

Plan (B) with an index fast full scan on a client_account index – cost 1,190, runtime 0.86 seconds, buffer gets 28696


----------------------------------------------------------------------------------------------------
| Id | Operation                      | Name              | Rows  | Bytes  | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|  1 |  NESTED LOOPS                  |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|  2 |   NESTED LOOPS                 |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|* 3 |    HASH JOIN                   |                   |     1 |    42  |  1188  (8) | 00:00:14 |
|  4 |     TABLE ACCESS BY INDEX ROWID| ACCOUNT           |     1 |    30  |     5  (0) | 00:00:01 |
|* 5 |      INDEX RANGE SCAN          | XAK1ACCOUNT       |     1 |        |     4  (0) | 00:00:01 |
|  6 |     INDEX FAST FULL SCAN       | XPKCLIENT_ACCOUNT | 9479K |   108M |  1145  (5) | 00:00:13 |
|* 7 |    INDEX UNIQUE SCAN           | XPKCLIENT         |     1 |        |     1  (0) | 00:00:01 |
|  8 |   TABLE ACCESS BY INDEX ROWID  | CLIENT            |     1 |    17  |     2  (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
     0  recursive calls
     0  db block gets
 28696  consistent gets
     0  physical reads
     0  redo size
   623  bytes sent via SQL*Net to client
   524  bytes received via SQL*Net from client
     2  SQL*Net roundtrips to/from client
     0  sorts (memory)
     0  sorts (disk)
     1  rows processed

Note, particularly, that the two plans are the same apart from operation 6 where a full tablescan changes to an index fast full scan, predicting the same number of rows but with an increase of 50% in the cost; the increase in cost is matched by an increase in the reported workload – a 60% increase in the number of consistent reads and no disk reads or recursive SQL in either case. Yet the execution time (on multiple repeated executions) dropped by nearly 40%.

So what’s interesting and informative about the plan ?

The cost of a tablescan or an index fast full scan is easy to calculate; broadly speaking it’s “size of object” / “multiblock read count” * k, where k is some constant relating to the hardware capability. The costs in these plans and the autotrace statistics seem to be telling us that the index is bigger than the table, while the actual run times seem to be telling us that the index has to be smaller than the table.

It’s easy for an index to be bigger than its underlying table, of course; for example, if this table consisted of nothing but two short columns the index could easily be bigger (even after a rebuild) because it would be two short columns plus a rowid. If that were the case here, though, we would expect the time to fast full scan the index to be higher than the time to scan the table.

So two thoughts crossed my mind as I looked at operation 6:

  • Mixing block sizes in a database really messes up the optimizer costing, particularly for tablescans and index fast full scans. Maybe the table had been built in a tablespace using 32KB  blocks while the index had been built in a tablespace using the more common 8KB blocksize – I didn’t want to start working out the arithmetic but that might be just enough to produce the contradiction.
  • Maybe the table was both bigger AND smaller than the index – bigger because it held more data, smaller because it had been compressed. If so then the difference in run-time would be the overhead of decompressing the rows before projecting and comparing the data.

Conveniently the OP has included an extract from the 10053 trace:


Table Stats::
  Table: CLIENT_ACCOUNT  Alias:  D
    #Rows: 9479811  #Blks:  18110  AvgRowLen:  71.00  ChainCnt:  0.00
  Column (#1): CLNT_ID(
    AvgLen: 6 NDV: 1261035 Nulls: 0 Density: 0.000001 Min: 0 Max: 4244786
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 239
  Column (#2): ACCT_ID(
    AvgLen: 6 NDV: 9479811 Nulls: 0 Density: 0.000000 Min: 1 Max: 22028568
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 255

Index Stats::
  Index: XPKCLIENT_ACCOUNT  Col#: 1 2
    LVLS: 2  #LB: 28543  #DK: 9479811  LB/K: 1.00  DB/K: 1.00  CLUF: 1809449.00

Note that the index is called xpclient_account – which suggests “primary key” –  and the number of distinct keys in the index (#DK) matches the number of rows in the table(#Rows). The index and table stats seem to be consistent so we’re not looking at a problem of bad statistics.

Now to do some simple (ballpark) arithmetic: for the table can we check if  “rows * average row length / 8K =  blocks”. We can read the numbers directly from the trace file:  9,500,000 * 71 / 8,000 = 84,000.  It’s wrong by a factor of about 4 (so maybe it’s a 32K block, and maybe I could rule out that possibility by including more detail in the arithmetic – like allowing properly for the block header, row overheads, pctfree etc).

For the index – we believe it’s the primary key, so we know the number of rows in the index – it’s the same as the number of distinct keys. As for the length of an index entry, we have the index definition (col#: 1 2) and we happen to have the column stats about those columns so we know their average length. Allowing for the rowid and length bytes we can say that the average index entry is (6 +1) + (6 + 1) + 6 = 20 bytes.  So the number of leaf blocks should be roughy 9,500,000 * 20 / 8,000 = 23,750. That’s close enough given the reported 28,543 and the fact that I haven’t bothered to worry about row overheads, block overheads and pctfree.

The aritmetic provides an obvious guess – which turned out to be correct: the table is compressed, the index isn’t. The optimizer hasn’t allowed for the CPU cost of decompressing the compressed rows, so the time required to decompress 9.5M rows doesn’t appear in the execution plan.

Footnote.

Looking at the column stats, it looks like there are roughly 8 acct_ids for each clnt_id, so it would probably be sensible to compress the primary key index (clnt_id, acct_id) on the first column as this would probably reduce the size of the index by about 20%.

Better still – the client_account table has very short rows – it looks like a typical intersection table with a little extra data carried. Perhaps this is a table that should be an index-organized table with no overflow. It looks like there should also be an index (acct_id, clnt_id) on this table to optimse the path from account to client and this would become a secondary index – interestingly being one of those rare cases where the secondary index on an IOT might actually be a tiny bit smaller than the equivalent index on a heap table because (in recent versions of Oracle) primary key columns that are included in the secondary key are not repeated in the index structure. (It’s a little strange that this index doesn’t seem to exist already – you might have expected it to be there given the OP’s query, and given that it’s an “obvious” requirement as an index to protect the foreign key.)

The only argument against the IOT strategy is that the table clearly compresses very well as a heap table, so a compressed heap table plus two B-tree indexes might be more cost-effective than an IOT with a single secondary index.

 

April 27, 2016

Stats History

Filed under: Histograms,Infrastructure,Oracle,Statistics — Jonathan Lewis @ 1:09 pm BST Apr 27,2016

From time to time we see a complaint on OTN about the stats history tables being the largest objects in the SYSAUX tablespace and growing very quickly, with requests about how to work around the (perceived) threat. The quick answer is – if you need to save space then stop holding on to the history for so long, and then clean up the mess left by the history that you have captured; on top of that you could stop gathering so many histograms because you probably don’t need them, they often introduce instability to your execution plans, and they are often the largest single component of the history (unless you are using incremental stats on partitioned objects***)

For many databases it’s the histogram history – using the default Oracle automatic stats collection job – that takes the most space, here’s a sample query that the sys user can run to get some idea of how significant this history can be:


SQL> select table_name , blocks from user_tables where table_name like 'WRI$_OPTSTAT%HISTORY' order by blocks;

TABLE_NAME                           BLOCKS
-------------------------------- ----------
WRI$_OPTSTAT_AUX_HISTORY                 80
WRI$_OPTSTAT_TAB_HISTORY                244
WRI$_OPTSTAT_IND_HISTORY                622
WRI$_OPTSTAT_HISTHEAD_HISTORY          1378
WRI$_OPTSTAT_HISTGRM_HISTORY           2764

5 rows selected.

As you can see the “histhead” and “histgrm” tables (histogram header and histogram detail) are the largest stats history tables in this (admittedly very small) database.

Oracle gives us a couple of calls in the dbms_stats package to check and change the history setting, demonstrated as follows:


SQL> select dbms_stats.get_stats_history_retention from dual;

GET_STATS_HISTORY_RETENTION
---------------------------
                         31

1 row selected.

SQL> execute dbms_stats.alter_stats_history_retention(7)

PL/SQL procedure successfully completed.

SQL> select dbms_stats.get_stats_history_retention from dual;

GET_STATS_HISTORY_RETENTION
---------------------------
                          7

1 row selected.

Changing the retention period doesn’t reclaim any space, of course – it simply tells Oracle how much of the existing history to eliminate in the next “clean-up” cycle. This clean-up is controllled by a “savtime” column in each table:

SQL> select table_name from user_tab_columns where column_name = 'SAVTIME' and table_name like 'WRI$_OPTSTAT%HISTORY';

TABLE_NAME
--------------------------------
WRI$_OPTSTAT_AUX_HISTORY
WRI$_OPTSTAT_HISTGRM_HISTORY
WRI$_OPTSTAT_HISTHEAD_HISTORY
WRI$_OPTSTAT_IND_HISTORY
WRI$_OPTSTAT_TAB_HISTORY

5 rows selected.

If all you wanted to do was stop the tables from growing further you’ve probably done all you need to do. From this point onwards the automatic Oracle job will start deleting the oldest saved stats and re-using space in the existing table. But you may want to be a little more aggressive about tidying things up, and Oracle gives you a procedure to do this – and it might be sensible to use this procedure anyway at a time of your own choosing:


SQL> execute dbms_stats.purge_stats(sysdate - 7);

Basically this issues a series of delete statements (including a delete on the “stats operation log (wri$_optstat_opr)” table that I haven’t previously mentioned) – here’s an extract from an 11g trace file of a call to this procedure (output from a simple grep command):


delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_tab_history          where savtime < :1 and rownum <= NVL(:2, rownum)
delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_ind_history h        where savtime < :1 and rownum <= NVL(:2, rownum)
delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_aux_history          where savtime < :1 and rownum <= NVL(:2, rownum)
delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_opr                  where start_time < :1 and rownum <= NVL(:2, rownum)
delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_histhead_history     where savtime < :1 and rownum <= NVL(:2, rownum)
delete /*+ dynamic_sampling(4) */ from sys.wri$_optstat_histgrm_history      where savtime < :1 and rownum <= NVL(:2, rownum)

Two points to consider here: although the appearance of the rownum clause suggests that there’s a damage limitation strategy built into the code I only saw one commit after the entire delete cycle, and I never saw a limiting bind value being supplied. If you’ve got a large database with very large history tables you might want to delete one day (or even just a few hours) at a time. The potential for a very long, slow, delete is also why you might want to do a manual purge at a time of your choosing rather than letting Oracle do the whole thing on auto-pilot during some overnight operation.

Secondly, even though you may have deleted a lot of data from these table you still haven’t reclaimed the space – so if you’re trying to find space in the sysaux tablespace you’re going to have to rebuild the tables and their indexes. Unfortunately a quick check of v$sysaux_occupants tells us that there is no official “move” producedure:


SQL> execute print_table('select occupant_desc, move_procedure, move_procedure_desc from v$sysaux_occupants where occupant_name = ''SM/OPTSTAT''')

OCCUPANT_DESC                 : Server Manageability - Optimizer Statistics History
MOVE_PROCEDURE                :
MOVE_PROCEDURE_DESC           : *** MOVE PROCEDURE NOT APPLICABLE ***

So we have to run a series of explicit calls to alter table move and alter index rebuild. (Preferably not when anyone is trying to gather stats on an object). Coding that up is left as an exercise to the reader, but it may be best to move the tables in the order of smallest table first, rebuilding indexes as you go.

Footnote:

*** Incremental stats on partitioned objects: I tend to assume that sites which use partitioning are creating very large databases and have probably paid a lot more attention to the details of how to use statistics effectively and successfully; that’s why this note is aimed at sites which don’t use partitioning and therefore think that the space taken up by the stats history significant.

March 11, 2016

Wrong Results ?

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 9:18 am BST Mar 11,2016

I gather that journalistic style dictates that if the headline is a question then the answer is no. So, following on from a discussion of possible side effects of partition exchange, let’s look at an example which doesn’t involve partitions.  I’ve got a schema that holds nothing by two small, simple heap tables, parent and child, (with declared primary keys and the obvious referential integrity constraint) and I run a couple of very similar queries that produce remarkably different results:


select
        par.id      parent_id,
        chi.id      child_id,
        chi.name    child_name
from
        parent  par,
        child   chi
where
        chi.id_p = par.id
order by
        par.id, chi.id
;

 PARENT_ID   CHILD_ID CHILD_NAME
---------- ---------- ----------
         1          1 Simon
         1          2 Sally
         2          1 Janet
         2          2 John
         3          1 Orphan

5 rows selected.

Having got this far with my first query I’ve decided to add the parent name to the report:


select
        par.id      parent_id,
        par.name    parent_name,
        chi.id      child_id,
        chi.name    child_name
from
        parent  par,
        child   chi
where
        chi.id_p = par.id
order by
        par.id, chi.id
;

 PARENT_ID PARENT_NAM   CHILD_ID CHILD_NAME
---------- ---------- ---------- ----------
         1 Smith2              1 Simon
         1 Smith               1 Simon
         1 Smith2              2 Sally
         1 Smith               2 Sally
         2 Jones               1 Janet
         2 Jones               2 John

6 rows selected.

How could adding a column to the select list result in one child row disappearing and two child rows being duplicated; and is this a bug ?

To avoid any confusion, here’s the complete script I used for creating the schema owner, in 11.2.0.4, with no extra privileges granted to PUBLIC:


create user u1
        identified by u1
        default tablespace test_8k
        quota unlimited on test_8k
;

grant
        create session,
        create table
to
        u1
;


Update

It didn’t take long for a couple of people to suggest that the oddity was the consequence of constraints that had not been enabled and validated 100% of the time, but the suggestions offered were a little more convoluted than necessary. Here’s the code I ran from my brand new account before running the two select statements:


create table parent (
        id      number(4),
        name    varchar2(10),
        constraint par_pk primary key (id)
        rely disable novalidate
)
;

create table child(
        id_p    number(4)
                constraint chi_fk_par
                references parent
                on delete cascade
                rely disable novalidate,
        id      number(4),
        name    varchar2(10),
        constraint chi_pk primary key (id_p, id)
                rely disable novalidate
)
;

insert into parent values (1,'Smith');
insert into parent values (1,'Smith2');
insert into parent values (2,'Jones');

insert into child values(1,1,'Simon');
insert into child values(1,2,'Sally');

insert into child values(2,1,'Janet');
insert into child values(2,2,'John');

insert into child values(3,1,'Orphan');

commit;

begin
        dbms_stats.gather_table_stats(user,'child');
        dbms_stats.gather_table_stats(user,'parent');
end;
/


In a typical data warehouse frame of mind I’ve added plenty of constraints, but left them all disabled and novalidated, but told Oracle to rely on them for optimisation strategies. This means all sorts of incorrect data could get into the tables, with all sorts of unexpected side effects on reporting. The example above shows duplicates on primary keys (and if you checked the table definition you’d find that the primary key columns were nullable as well), child rows with no parent key.

In fact 11g and 12c behave differently – the appearance of the Orphan row in the first sample query is due, as Chris_cc pointed out in the first comment, to the optimizer deciding that it could use join elimination because it was joining to a single-column primary key without selecting any other columns from the referenced table. In 12c the optimizer doesn’t use join elimination for this query, so both queries have the same (duplicated) output.

Update:

Make sure you read the articles linked to by Dani Schneider’s comment below, and note especially the impact on the query_rewrite_integrity parameter.

February 26, 2016

Partition Limit

Filed under: Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 2:19 pm BST Feb 26,2016

A tweet from Connor McDonald earlier on today reminded me of a problem I managed to pre-empt a couple of years ago.

Partitioning is wonderful if done properly but it’s easy to get a little carried away and really foul things up. So company “X” decided they were going to use range/hash composite partitioning and, to minimise contention and (possibly) reduce the indexing overheads, they decided that they would create daily partitions with 1,024 subpartitions.

This, in testing, worked very well, and the idea of daily/1024 didn’t seem too extreme given the huge volume of data they were expecting to handle. There was, however, something they forgot to test; and I can demonstrate this on 12c with an interval/hash partitioned table:


rem     Script:         pt_interval_hash_2.sql
rem     Author:         Jonathan Lewis

column start_date new_value m_start_date
select to_char(trunc(sysdate),'dd-mon-yyyy') start_date from dual;

create table interval_hash (
        d1      date            not null,
        n1      number(10,0)    not null,
        v1      varchar2(100)
)
tablespace test_8k_assm_2
partition by range (d1) interval (numtodsinterval(1,'DAY'))
subpartition by hash (n1) subpartitions 1024
        (
                partition p0 values less than (to_date('&m_start_date','dd-mon-yyyy'))
        )
;

insert into interval_hash values (trunc(sysdate) + 1023, 1, rpad('x',100));
insert into interval_hash values (trunc(sysdate) + 1022, 1, rpad('x',100));
insert into interval_hash values (trunc(sysdate) + 1021, 1, rpad('x',100));
commit;

I’ve started the partition range on trunc(sysdate), so it’s easy to pick a few very important future dates for insertion.

So what do you think is going to happen as I try to insert those three rows ?


insert into interval_hash values (trunc(sysdate) + 1023, 1, rpad('x',100))
            *
ERROR at line 1:
ORA-14300: partitioning key maps to a partition outside maximum permitted number of partitions


insert into interval_hash values (trunc(sysdate) + 1022, 1, rpad('x',100))
*
ERROR at line 1:
ORA-14299: total number of partitions/subpartitions exceeds the maximum limit


1 row created.

One of them succeeds – the other two fail with DIFFERENT error messages.

The limit on the number of segments allowed for a single table (or index) is 2^20 – 1 (i.e. 1,048575); and with 1,024 hash subpartitions I’m only allowed 1,023 partitions. I think that we get two different error messages because Oracle can work out that the first subpartition that it would create for sysdate + 1023 will exceed the segment limit, but it doesn’t “realise” that it’s going to go beyond the limit with sysdate + 1022 until it’s too late and it’s started working on creating the segments.

In fact, in its attempt to create the sysdate + 1022 partition Oracle actually temporarily allocated 1GB of space (I had a 1MB uniform extent size for the tablespace) and updated various bits of the data dictionary generating a total of 13,000 redo entries and 2.3MB of redo – for a single row insert!

So – if you want 1,024 hash partitions you’re only allowed 3 years of daily partitions. If you want a longer history you need fewer hash partitions (and don’t forget to stick to a power of 2) or longer durations for each partition. If 3 years is sufficient, though, make sure you test how long it takes to drop old partitions from the start of a table that is getting close to the limit.

 

February 11, 2016

Quiz Night

Filed under: 12c,Infrastructure,LOBs,Oracle,Statistics — Jonathan Lewis @ 3:28 pm BST Feb 11,2016

I was setting up a few tests on a copy of 12.1.0.2 recently when I made a mistake creating the table – I forgot to put in a couple of CAST() calls in the select list, so I just patched things up with a couple of “modify column” commands. Since I was planning to smash the table in all sorts of ways and it had taken me several minutes to create the data set (10 million rows) I decided to create a clean copy of the data so that I could just drop the original table and copy back the clean version – and after I’d done this I noticed something a little odd.

Here’s the code (cut down to just 10,000 rows), with a little output:

--
--      Script:  12c_vc32767_b.sql
--

create table t1 (
        id      not null,
        date_open,
        date_closed,
        deal_type,
        client_ref,
        small_vc,
        padding
)
nologging
as
select
        rownum                                          id,
        trunc(
                add_months(sysdate, -120) +
                        (rownum)* 3652 / 1e7
        )                                               date_open,
        trunc(
                add_months(
                        add_months(sysdate, -120) +
                                (rownum) * 3652 / 1e7,
                        12 * trunc(dbms_random.value(1,6))
                )
        )                                               date_closed,
        dbms_random.string('U',1)                       deal_type,
        dbms_random.string('U',4)                       client_ref,
        lpad(rownum,10)                                 vc_small,
        rpad('x',100,'x')                               vc_padding
from
        dual
connect by
        level <= 10000
;
alter table t1 modify deal_type varchar2(1);
alter table t1 modify client_ref varchar2(4);

create table t2 nologging as select * from t1;

begin
        dbms_stats.gather_table_stats(
                user,
                't1',
                method_opt=>'for all columns size 1'
        );
        dbms_stats.gather_table_stats(
                user,
                't2',
                method_opt=>'for all columns size 1'
        );
end;
/

select table_name, num_rows, blocks, avg_row_len from user_tables;

TABLE_NAME             NUM_ROWS     BLOCKS AVG_ROW_LEN
-------------------- ---------- ---------- -----------
T1                        10000        304         139
T2                        10000        218         139

There’s no hidden code – this is the code I ran, and the tables both went into the same tablespace- yet one table has used about 30% more blocks than the other on a simple “create as select”, even though the row lengths and row counts are the same!

When I first did this test it just so happened that the only place in the database I had to make the t2 copy was in a tablespace using freelist management, and I’d created the original table in a tablespace using ASSM, so I wasted a little time examining the ASSM space management (bitmap) blocks before I realised what had happened.

Suggestions about what has happened, and how to confirm your hypothesis, invited in the comments.

Update (following morning)

The explanation is in the comments: I had enabled extended (32,767 byte) varchar2().  My calls to dbms_random.string() had returned character values, and the return type defaults to the maximum character length, which meant they were declared as varchar2(32767) and those are implemented as (unhackable) CLOBs in 12.1.0.2.

When I modified the table to change the varchar2(32767) to shorter lengths Oracle let me get away with it because none of the actual stored values at that time was longer than my target declaration – but it didn’t physically change the stored column values to “short” varchar2(), it left them as CLOBs, and a “short” CLOB includes a LOB Locator which starts out at about 30 bytes.

When I created t2 from t1 Oracle first copied the (short) column definitions, and then copied the data, reading and converting the CLOBs to normal varchar2() storage, and that explains why t2 needed fewer blocks.  The avg_row_len for the two tables matched because the code to gather stats simply applies the sys_op_opnsize() function to the varchar2() defined in t1, and doesn’t give you any indication about the LOB locator.

(If you use the extremely deprecated analyze command to gather stats on the two tables you’ll find that the avg_row_len of table t1 then allows for the LOB locator size.)

“Workaround:”

I should have used:  cast(dbms_random.string(‘U’,4) as varchar2(4)) and the column definition would have been a proper varchar2(4) from the start.  (As noted in a comment from Ivica Arsov below, substr(…, 1, 4) would also have worked.  Of course I should have used cast() for the numerics as well so that I could declare them as things like number(8,0) rather than the slightly dangerous “number”.

Note:  I’ve put “workaround” in quotes because it’s not really a workaround to a problem – it’s just an example of doing it right.

Footnote:

Here’s a symbolic dump of a single character (‘T’) being stored as a varchar2(32767):


LOB
Locator:
  Length:        84(31)
  Version:        1
  Byte Length:    1
  LobID: 00.00.00.01.00.00.01.6b.ba.d7
  Flags[ 0x01 0x0c 0x00 0x80 ]:
    Type: BLOB
    Storage: SecureFile
    Characterset Format: IMPLICIT
    Partitioned Table: No
    Options: ReadWrite
  SecureFile Header:
    Length:   11
    Old Flag: 0x48 [ DataInRow SecureFile ]
    Flag 0:   0x90 [ INODE Valid ]
    Layers:
      Lengths Array: INODE:5
      INODE:
        00 00 01 01 54

Note the 84(31) at line 3: this is a reminder of how big a LOB locator could get for an out of line LOB, compared to the current size of the LOB locator. Rows that hold out of line LOBs can actually be much longer than avg_row_len tells you.

Addendum

Another side effect of enabling extended varchar2() types is the surprise you get when you use error logging in SQL (i.e. syntax like: “insert into t1 select … log errors”). When you create the “clone” table for the log it changes every column from the original into a varchar2(32767) e.g.:

desc target
 Name                                                                     Null?    Type
 ------------------------------------------------------------------------ -------- -------------------------------------------------
 ID                                                                                NUMBER
 N1                                                                                NUMBER
 N2                                                                                NUMBER
 SMALL_VC                                                                          VARCHAR2(10)
 PADDING                                                                           VARCHAR2(100)


begin
        dbms_errlog.create_error_log('TARGET');
end;
/

desc err$_TARGET
 Name                                                                     Null?    Type
 ------------------------------------------------------------------------ -------- -------------------------------------------------
 ORA_ERR_NUMBER$                                                                   NUMBER
 ORA_ERR_MESG$                                                                     VARCHAR2(2000)
 ORA_ERR_ROWID$                                                                    ROWID
 ORA_ERR_OPTYP$                                                                    VARCHAR2(2)
 ORA_ERR_TAG$                                                                      VARCHAR2(2000)
 ID                                                                                VARCHAR2(32767)
 N1                                                                                VARCHAR2(32767)
 N2                                                                                VARCHAR2(32767)
 SMALL_VC                                                                          VARCHAR2(32767)
 PADDING                                                                           VARCHAR2(32767)

You probably hope that you’re not going to see many rows inserted into the error log table, so the extra space taken up in the table by error rows probably won’t be a problem – but the fact that you get two segments (the LOB and the LOBINDEX) created for every column in the original table might cause a problem – even if they are completely empty.

February 2, 2016

Partitioned Bitmap Join

Filed under: bitmaps,Bugs,Indexing,Infrastructure,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 8:32 am BST Feb 2,2016

If you don’t want to read the story, the summary for this article is:

If you create bitmap join indexes on a partitioned table and you use partition exchanges to load data into the table then make sure you create the bitmap join indexes on the loading tables in exactly the same order as you created them on the partitioned table or the exchange will fail with the (truthful not quite complete) error: ORA-14098: index mismatch for tables in ALTER TABLE EXCHANGE PARTITION.

My story starts with this OTN posting from John Hall where he found after a year of successful batch loading one of his partition exchanges was raising error 14098. After an exchange of ideas, user rp0428 came up with a query against sys.jijoin$ (one of the tables behind bitmap join indexes) that allowed John Hall to see that the indexes on the exchange table had been created in a different order from that of the partitioned table. I did a quick test to see if this might be relevant (it shouldn’t be, it isn’t with “normal” indexes or function-based indexes, or virtual columns) and didn’t manage to reproduce the problem with two dimension tables and two bitmap join indexes.

Fortunately John didn’t take my word for it and tested the idea on a clone of the production system – and found that the order of creation did matter. His system, however, had 9 dimension tables and 33 bitmap join indexes – which shouldn’t have made any difference in principle, but maybe it was something to do with having several indexes on the same table,  maybe it was something to do with have far more tables or far more indexes than I had. So I built a larger test case with 6 dimension tables and six indexes per table – and reproduced the problem.

Then I started cutting back to see where the problem appeared, and found that all it took was one dimension with two indexes, or two dimensions with one index each – whatever I had done in my “quick test” I had clearly done it too quickly and done something wrong. (Unfortunately I had overwritten most of the code from the original quick test while building the larger test, so I couldn’t go back and see where the error was.)

Here, then, is the minimal test case that I finally ran to demonstrate that switching the order of index creation on the exchange table causes the exchange to fail:


drop table pt_range purge;
drop table t1 purge;
drop table dim_1 purge;
drop table dim_2 purge;

prompt  =================
prompt  Partitioned table
prompt  =================

create table pt_range (
        id,
        grp1,
        grp2,
        padding
)
nologging
partition by range(id) (
        partition p2001 values less than (2001),
        partition p4001 values less than (4001),
        partition p6001 values less than (6001),
        partition p8001 values less than (8001)
)
as
select
        rownum                          id,
        trunc(rownum/100)               grp1,
        trunc(rownum/100)               grp2,
        rpad('x',100)                   padding
from
        all_objects
where 
        rownum <= 8000
;

prompt  ================================================
prompt  Exchange table - loaded to match partition p8001
prompt  ================================================

alter table pt_range 
add constraint pt_pk primary key (id) using index local;

create table t1 (
        id,
        grp1,
        grp2,
        padding
)
as 
select
        rownum + 6000                   id,
        trunc(rownum/100)               grp1,
        trunc(rownum/100)               grp2,
        rpad('x',100)                   padding
from
        all_objects
where 
        rownum <= 2000
;

alter table t1
add constraint t1_pk primary key (id);

execute dbms_stats.gather_table_stats(user,'pt_range')
execute dbms_stats.gather_table_stats(user,'t1')

prompt  ================
prompt  dimension tables
prompt  ================

create table dim_1 
as 
select distinct 
        grp1, 
        cast('A'||grp1 as varchar2(3)) agrp1,
        cast('B'||grp1 as varchar2(3)) bgrp1
from
        t1
;

create table dim_2 as select * from dim_1;

prompt  ===============================
prompt  Primary keys required for BMJIs
prompt  ===============================

alter table dim_1 add constraint d1_pk primary key (grp1);
alter table dim_2 add constraint d2_pk primary key (grp1);

execute dbms_stats.gather_table_stats(user,'dim_1')
execute dbms_stats.gather_table_stats(user,'dim_2')

prompt  ============================
prompt  Creating bitmap join indexes
prompt  ============================

create bitmap index pt_1a on pt_range(d1.agrp1) from pt_range pt, dim_1 d1 where d1.grp1 = pt.grp1 local ;
create bitmap index pt_2a on pt_range(d2.agrp1) from pt_range pt, dim_2 d2 where d2.grp1 = pt.grp2 local ;

prompt  ====================================================
prompt  Pick your index creation order on the exchange table
prompt  ====================================================

create bitmap index t1_1a on t1(d1.agrp1) from t1, dim_1 d1 where d1.grp1 = t1.grp1 ;
create bitmap index t1_2a on t1(d2.agrp1) from t1, dim_2 d2 where d2.grp1 = t1.grp2 ;
-- create bitmap index t1_1a on t1(d1.agrp1) from t1, dim_1 d1 where d1.grp1 = t1.grp1 ;

prompt  ==================
prompt  Exchanging (maybe)
prompt  ==================

alter table pt_range
        exchange partition p8001 with table t1
        including indexes
        without validation
;

I’ve got the same create statement twice for one of the bitmap join indexes – as it stands the indexes will be created in the right order and the exchange will work; if you comment out the first t1_1a create and uncomment the second the exchange will fail. (If you comment out the ‘including indexes’ then the exchange will succeed irrespective of the order of index creation, but that rather defeats the point of being able to exchange partitions.)

I’ve reproduced the problem in 12.1.0.2, 11.2.0.4 and 10.2.0.5

Footnote: running an extended trace didn’t help me work out how Oracle is detecting the mismatch, presumably it’s something that gets into the dictionary cache in a general “load the index definition” step; but it did show me that (in the “without validation” case) the code seems to check the correctness of the exchange table’s primary key data BEFORE checking whether the indexes match properly.

January 28, 2016

Bitmap Efficiency

Filed under: bitmaps,Indexing,Infrastructure,Oracle — Jonathan Lewis @ 1:02 pm BST Jan 28,2016

An interesting observation came up on the Oracle-L list server a few days ago that demonstrated how clever the Oracle software is at minimising run-time work, and how easy it is to think you know what an execution plan means when you haven’t actually thought through the details – and the details might make a difference to performance.

The original question was about a very large table with several bitmap indexes, and an anomaly that appeared as a query changed its execution plan.  Here are the critical sections from the plans (extracted from memory with rowsource execution statistics enabled):

--------------------------------------------------------------------------------------------------------
|  Id |Operation                        | Name       | Starts | E-Rows | A-Rows |     A-Time | Buffers |
--------------------------------------------------------------------------------------------------------
|   6 |    TABLE ACCESS BY INDEX ROWID  |       FACT |      1 |      1 |     24 |00:00:00.01 |      31 |
|   7 |     BITMAP CONVERSION TO ROWIDS |            |      1 |        |     24 |00:00:00.01 |       7 |
|   8 |      BITMAP AND                 |            |      1 |        |      1 |00:00:00.01 |       7 |
|*  9 |       BITMAP INDEX SINGLE VALUE |     FACT_0 |      1 |        |      1 |00:00:00.01 |       3 |
|* 10 |       BITMAP INDEX SINGLE VALUE |  FACT_DIM1 |      1 |        |      4 |00:00:00.01 |       4 |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
     9 - access("FACT"."C0"=243001)
    10 - access("FACT"."C1"="DIMENSION1"."ID")


-------------------------------------------------------------------------------------------------------
|  Id | Operation                      | Name       | Starts | E-Rows | A-Rows |     A-Time | Buffers |
-------------------------------------------------------------------------------------------------------
|   7 |    BITMAP CONVERSION TO ROWIDS |            |      5 |        |      8 |00:00:00.01 |     119 |
|   8 |     BITMAP AND                 |            |      5 |        |      1 |00:00:00.01 |     119 |
|*  9 |      BITMAP INDEX SINGLE VALUE |  FACT_DIM1 |      5 |        |     20 |00:00:00.01 |      28 |
|* 10 |      BITMAP INDEX SINGLE VALUE |  FACT_DIM2 |      5 |        |    140 |00:00:00.01 |      78 |
|* 11 |      BITMAP INDEX SINGLE VALUE |     FACT_0 |      5 |        |      5 |00:00:00.01 |      13 |
|  12 |   TABLE ACCESS BY INDEX ROWID  |       FACT |      8 |      1 |      8 |00:00:00.01 |       8 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
     9 - access("FACT"."C1"="DIMENSION1"."ID")
    10 - access("FACT"."C2"="DIMENSION2"."ID")
    11 - access("FACT"."C0"=243001)

The first plan shows the steps leading to a single access (Starts = 1) to the FACT table after combining two bitmap indexes; the second shows the second child of a nested loop join where Oracle has combined three bitmaps indexes to access the FACT table – operation 7 (and its descendants) execute 5 times in this case. I’ve included the related parts of the predicate section so that you can see that the predicates at operations 9 and 10 of the first plan are the same as the predicates at operations 9 and 11 of the second plan.

So here’s the question – if one access to fact_dim1 requires 4 buffer visits, why does it take 28 buffer visits to do the same thing 5 times (and it is with the same value every time); conversely if one access to fact_0 requires 3 buffer visits, why do 5 visits to do the same thing take only 13 buffer visits. (Note: the arithmetic is made a little more obscure by the way in which index branch blocks may be pinned during nested loop joins.)

Then there’s a further question – not visible in the plan – the A-Rows in the “BITMAP INDEX SINGLE VALUE” operation is the number of bitmap sections in the rowsource, and we can see that the key values for index fact_dim2 have a significant number of bitmap chunks for a single key (5 executions returned 140 bitmap chunks). This scale, though, is true of all three indexes – in fact a follow-up email pointed out that a typical key value in EVERY ONE of the three indexes consisted of about 100 bitmap chunks, so why can’t we see those hundreds in the execution plan ?

So this is where we’re at: we have an execution plan where we haven’t visited all the bitmap chunks for a bitmap key, and the order in which the bitmap indexes are used in the plan seems to have some effect on the choice of leaf-blocks you visit when accessing the chunks. So (a) could a change in the order of indexes make a significant difference to the number of bitmap chunks you visit and the resulting performance, and (b) is there a way to control the order in which you visit the indexes. That’s where the note starts to get a bit technical – if you don’t want to read any more the answers are: (a) yes but probably not significantly and (b) yes.

Demo

To investigate what goes on inside a “BITMAP AND” I created a table with two bitmap indexes and used a very large setting for pctfree for the indexes so that they had to be stored with a large number of bitmap chunks per key. Here’s the code that I used, with some results from an instance of 12.1.0.2:


create table people
nologging
as
with generator as (
        select  --+ materialize 
                rownum id 
        from dual
        connect by
                level <= 1e4
)
select
        rownum                  id,
        mod(rownum-1, 1e2)      id_town_home,
        trunc((rownum-1)/1e4)   id_town_work,
        rpad('x',10,'x')        small_vc,
        rpad('x',100,'x')       padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6
;
begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'PEOPLE',
                method_opt       => 'for all columns size 1'
        );
end;
/

create bitmap index pe_home on people(id_town_home) nologging pctfree 95;
create bitmap index pe_work on people(id_town_work) nologging pctfree 95;

select
        index_name, distinct_keys, num_rows, leaf_blocks, avg_leaf_blocks_per_key
from
        user_indexes
where
        table_name = 'PEOPLE'
order by
        index_name
;


INDEX_NAME           DISTINCT_KEYS   NUM_ROWS LEAF_BLOCKS AVG_LEAF_BLOCKS_PER_KEY
-------------------- ------------- ---------- ----------- -----------------------
PE_HOME                        100      30399       15200                     152
PE_WORK                        100       1800         907                       9

As you can see I’ve generated two columns (id_town_home, id_town_work) with 100 distinct values and 10,000 rows each, but with very different data distributions – the rows for any given value for id_town_home are uniformly spread across the entire table, every hundredth row; while the rows for any given value of id_town_work are very tightly clustered as a group of 10,000 consecutive rows. As a consequence the index entry (bitmap string) for a typical key value for id_town_home is enormous and has to be broken into 304 chunks spread across 152 leaf blocks (2 index entries per leaf block), while the index entry for a typical key value for id_town_work is much shorter, but still requires 18 chunks spread across 9 leaf blocks.

So what will I see if I run the following query, and force it to use a BITMAP AND of the two indexes, in the two different orders:

select
        /*+ index_combine(pe) */
        max(small_vc)
from
        people pe
where
        id_town_home = 50
and     id_town_work = 50
;

Based on a very simple interpretation of the typical execution plan and using the index stats shown above we might expect to see roughly A-Rows = 18 with 9 buffer gets (plus a few more for segment headers and branch blocks) on the id_town_work index and A-Rows = 304 with 152 buffer gets on the id_town_home index to allow Oracle to generate and compare the two bit strings – but here are the two plans with their execution stats, generated in 12.1.0.2, and each run after flushing the buffer cache:

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |         |      1 |        |      1 |00:00:00.01 |     118 |    117 |
|   1 |  SORT AGGREGATE                      |         |      1 |      1 |      1 |00:00:00.01 |     118 |    117 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| PEOPLE  |      1 |    100 |    100 |00:00:00.01 |     118 |    117 |
|   3 |    BITMAP CONVERSION TO ROWIDS       |         |      1 |        |    100 |00:00:00.01 |      18 |     17 |
|   4 |     BITMAP AND                       |         |      1 |        |      1 |00:00:00.01 |      18 |     17 |
|*  5 |      BITMAP INDEX SINGLE VALUE       | PE_WORK |      1 |        |     18 |00:00:00.01 |      14 |     13 |
|*  6 |      BITMAP INDEX SINGLE VALUE       | PE_HOME |      1 |        |      4 |00:00:00.01 |       4 |      4 |
-------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |         |      1 |        |      1 |00:00:00.01 |     122 |    120 |
|   1 |  SORT AGGREGATE                      |         |      1 |      1 |      1 |00:00:00.01 |     122 |    120 |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| PEOPLE  |      1 |    100 |    100 |00:00:00.01 |     122 |    120 |
|   3 |    BITMAP CONVERSION TO ROWIDS       |         |      1 |        |    100 |00:00:00.01 |      22 |     20 |
|   4 |     BITMAP AND                       |         |      1 |        |      1 |00:00:00.01 |      22 |     20 |
|*  5 |      BITMAP INDEX SINGLE VALUE       | PE_HOME |      1 |        |      5 |00:00:00.01 |       8 |      7 |
|*  6 |      BITMAP INDEX SINGLE VALUE       | PE_WORK |      1 |        |     18 |00:00:00.01 |      14 |     13 |
-------------------------------------------------------------------------------------------------------------------

We have NOT touched anything like the entire bit-string for the id_town_home index – a bit-string that spans 152 leaf blocks! Clearly Oracle is doing something clever to minimise the work, and it’s so clever that switching the order of these two extremely different indexes in the plan has made virtually no difference to the work done. Obviously I can’t tell you exactly what the code is doing, but I think I can produce a reasonable guess about what’s going on.

The pe_work index has the smaller number of leaf blocks per key, which makes it the better starting choice for the AND in this case, so the optimizer’s default starting action was to pick the first couple of chunks of that index key value; and Oracle immediately sees that the first rowid that it could possibly need in its result set is roughly in the middle of the table – remember that the “key” columns of a bitmap index are (real_key, first_rowid_of chunk, last_rowid_of_chunk, compressed_bitstring).

Since it now knows the lowest possible rowid that it could need Oracle can now probe the pe_home index by (id_town_home=50, {target_rowid}) – which will let it go to a bitmap index chunk that’s roughly in the middle of the full range of 152. Then Oracle can expand the bitstrings from the chunks it has, reading new chunks as needed from each of the indexes until the 18 chunks / 9 leaf block from the pe_work index have been used up (and that range would have aligned with just two or three chunks from the pe_home index) at which point Oracle can see there’s no more rows in the table that could match both predicates and it doesn’t need to read the next 75 chunks of the pe_home index.

Conversely, when I forced Oracle to use the (inappropriate) pe_home index first, it read the first couple of chunks, then read the first couple of chunks of the pe_work index, at which point it discovered that it didn’t need any of the pe_home index prior to (roughly) chunk 75, so it jumped straight to the right chunk to align with pe_work and carried on from there. That’s why the forced, less efficient, plan that visited pe_home first visited just a couple more leaf blocks than the plan the optimizer selected for itself.

Bottom line on performance (tl;dr) – Oracle is sufficiently smart about checking the start and end ranges on bitmap indexes (rather then arbitrarily expanding the entire bitmap for each key) that even for very large bitmap index entries it will probably only access a couple of “redundant” leaf blocks per index even if it picks the worst possible order for using the indexes. You’re far more likely to notice Oracle picking the wrong indexes (because you know the data better) than you are to spot it using the right indexes in the wrong order – and given that bitmap indexes tend to be relatively small and well buffered (compared to the tables), and given the relatively large number of rows we pick by random I/O from fact tables, a little extra work in the bitmap indexes is unlikely to make a significant difference to the performance of most queries.

Closing fact: in the unlikely circumstances that you do spot the special case where it will make a difference (and it will probably be a difference in CPU usage) then you can dictate the order of the indexes with the undocumented bitmap_tree() hint.  I may get round to writing up the variations one day but, for this simple case, the index_combine() hint that I used to force the BITMAP AND turned into the following bitmap_tree() hint in the outline:

bitmap_tree(@sel$1 pe@sel$1 and((people.id_town_work) (people.id_town_home)))

bitmap_tree( @query_block     table_name@query_block     and( ({first index definition}) ({second index definition}) ) )

Obviously not suitable to throw into production code casually – check with Oracle support if you think it’s really necessary – but if you wanted to reverse the order of index usage in this case you could just swap the order of the index definitions. If you thought there was a third index that should be used you could include its definition (note that it’s table_name.column_name – the index definition – in the brackets).

My reference: bitmap_control_02.sql

Next Page »

Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 6,668 other followers