Oracle Scratchpad

August 24, 2016

Basicfile LOBs 4

Filed under: Infrastructure,LOBs,Oracle,Uncategorized — Jonathan Lewis @ 7:02 pm BST Aug 24,2016

At the end of the previous installment we saw that a single big batch delete would (apparently) attach all the “reusable” chunks into a single freepool, and asked the questions:

  • Why would the Oracle developer think that this use of one freepool is a good idea ?
  • Why might it be a bad idea ?
  • What happens when we start inserting more data ?

(Okay, I’ll admit it, the third question is a clue about the answer to the second question.)

I find that this process of asking “what’s good, what’s bad, what could possibly go wrong” is an excellent way to prompt thoughts about why Oracle Corp. might have chosen a particular strategy and what that means in terms of the best (or expected) use of the feature and worst threats from misuse of the feature. So lets’s see what thoughts we can come up with.

  • Good idea: The only alternative to using a single freepool when you make chunks reusable is to spread the chunks uniformly across all the freepools – either putting the chunks onto the same free pool that the LOB was previously attached to or doing some sort of round-robin. If you go for either of these fair-share strategies you increase the amount of contention on LOB deletes if many users are deleting at the same time – which sounds like someething you might want to avoid, but LOBs are supposed to be fairly static (somewhere on MoS there’s a note that says the expected behaviour is pretty much: “we thought you’d write once, read many, and not update”) so surely a small amount of contention shouldn’t be a big problem
  • Bad idea: As mentioned in a previous post, it looks like the freepool picked by a process is dependent on the process id – so if you happen to have just a couple of processes doing large deletes they might, coincidentally, pick the same freepool and end up constantly contending with each other rather than drifting in and out of collisions. If, as often happens with archive-like processes, you use one or two processes to delete a large fraction of the data you end up with one or two freepools holding lots of reusable space and all the other freepools holding no freespace – which brings us to the third question.
  • What happens next: Let’s say 3% of your LOB (one day out of a month) is currently “reusable chunks” and the chunks are all attached to the same freepool; your process connects to insert some new LOBs and its process id identifies the wrong freepool. There are no free blocks below the highwater mark and the retention limit is long gone. Does your process (a) add an extent to create some more free space (this is the type of thing that used to happen with manual segment space management, freelist groups and freelists for non-LOB tables and indexes) or (b) start stealing from another freepool that has reusable chunks. In either case what’s going to happen in the longer term ?
  • What happens even later: Imagine you have 28 days of data and use a single process to delete data on the 29th day. For reasons of concurrency you have been running with freepools 20. If option (a) applies then (assuming everything works perfectly) at steady state you will end up with roughly 20 days worth of reusable chunks spread across your 20 freepools before the system stabilises and stops adding unnecessary extents; if option (b) applies then (assuming everything works perfectly) every night you put a load of reusable chunks on one freepool and all through the day your 20 processes are fighting (at the oldest end of the index) to reuse those chunks. I said in an earlier installment that multiple freepools got rid of “the two hot spots” – this single thread deletion strategy has just brought one of them back.

So what really happens ? By the end of the last installment I had deleted the oldest 3,000 LOBs and found them attached as reusable chunks in freepool 2 with several consecutive “empty”  (nrows=81, rrows=0) leaf blocks at the low end of all the other pools.  After running my 4 concurrent processes to insert 750 rows each (i.e. insert the replacements for the 3,000 rows I’ve deleted) this is what the index treedump looks like (with a little editing to show the main breaks between freepools):

----- begin tree dump
branch: 0x1800204 25166340 (0: nrow: 60, level: 1)
   leaf: 0x180020e 25166350 (-1: nrow: 22 rrow: 22)
   leaf: 0x1800212 25166354 (0: nrow: 76 rrow: 76)
   leaf: 0x1800216 25166358 (1: nrow: 81 rrow: 81)
   leaf: 0x180021a 25166362 (2: nrow: 74 rrow: 74)
   leaf: 0x1800239 25166393 (3: nrow: 81 rrow: 81)
   leaf: 0x180023d 25166397 (4: nrow: 81 rrow: 81)
   leaf: 0x1800206 25166342 (5: nrow: 81 rrow: 81)
   leaf: 0x180020a 25166346 (6: nrow: 81 rrow: 81)
   leaf: 0x180021e 25166366 (7: nrow: 81 rrow: 81)
   leaf: 0x1800222 25166370 (8: nrow: 81 rrow: 81)
   leaf: 0x180022a 25166378 (9: nrow: 81 rrow: 81)
   leaf: 0x180022e 25166382 (10: nrow: 78 rrow: 78)
   leaf: 0x1800232 25166386 (11: nrow: 151 rrow: 151)
   leaf: 0x1800226 25166374 (12: nrow: 0 rrow: 0)
   leaf: 0x180020f 25166351 (13: nrow: 64 rrow: 64)
   leaf: 0x1800213 25166355 (14: nrow: 77 rrow: 77)
   leaf: 0x1800217 25166359 (15: nrow: 81 rrow: 81)
   leaf: 0x1800261 25166433 (16: nrow: 81 rrow: 81)
   leaf: 0x1800265 25166437 (17: nrow: 81 rrow: 81)
   leaf: 0x1800269 25166441 (18: nrow: 81 rrow: 81)
   leaf: 0x180026d 25166445 (19: nrow: 81 rrow: 81)
   leaf: 0x1800271 25166449 (20: nrow: 81 rrow: 81)
   leaf: 0x1800275 25166453 (21: nrow: 81 rrow: 81)
   leaf: 0x1800279 25166457 (22: nrow: 81 rrow: 81)
   leaf: 0x180027d 25166461 (23: nrow: 81 rrow: 81)
   leaf: 0x1800242 25166402 (24: nrow: 122 rrow: 122)
   leaf: 0x1800229 25166377 (25: nrow: 0 rrow: 0)
   leaf: 0x1800214 25166356 (26: nrow: 36 rrow: 36)
   leaf: 0x1800230 25166384 (27: nrow: 81 rrow: 81)
   leaf: 0x1800238 25166392 (28: nrow: 81 rrow: 81)
   leaf: 0x180023c 25166396 (29: nrow: 81 rrow: 81)
   leaf: 0x1800225 25166373 (30: nrow: 81 rrow: 81)
   leaf: 0x180022d 25166381 (31: nrow: 75 rrow: 75)
   leaf: 0x1800231 25166385 (32: nrow: 81 rrow: 81)
   leaf: 0x1800235 25166389 (33: nrow: 81 rrow: 81)
   leaf: 0x180022b 25166379 (34: nrow: 81 rrow: 81)
   leaf: 0x180022f 25166383 (35: nrow: 81 rrow: 81)
   leaf: 0x1800233 25166387 (36: nrow: 81 rrow: 81)
   leaf: 0x1800237 25166391 (37: nrow: 134 rrow: 134)
   leaf: 0x1800215 25166357 (38: nrow: 1 rrow: 0)
   leaf: 0x180026e 25166446 (39: nrow: 4 rrow: 0)
   leaf: 0x180021b 25166363 (40: nrow: 1 rrow: 0)
   leaf: 0x180024b 25166411 (41: nrow: 2 rrow: 0)
   leaf: 0x1800276 25166454 (42: nrow: 2 rrow: 0)
   leaf: 0x180024f 25166415 (43: nrow: 0 rrow: 0)
   leaf: 0x180027e 25166462 (44: nrow: 4 rrow: 0)
   leaf: 0x1800221 25166369 (45: nrow: 0 rrow: 0)
   leaf: 0x180027a 25166458 (46: nrow: 0 rrow: 0)
   leaf: 0x1800218 25166360 (47: nrow: 0 rrow: 0)
   leaf: 0x180021c 25166364 (48: nrow: 152 rrow: 0)
   leaf: 0x1800220 25166368 (49: nrow: 152 rrow: 0)
   leaf: 0x1800224 25166372 (50: nrow: 152 rrow: 0)
   leaf: 0x1800228 25166376 (51: nrow: 152 rrow: 72)
   leaf: 0x180022c 25166380 (52: nrow: 152 rrow: 152)
   leaf: 0x1800234 25166388 (53: nrow: 152 rrow: 152)
   leaf: 0x1800253 25166419 (54: nrow: 152 rrow: 152)
   leaf: 0x1800257 25166423 (55: nrow: 152 rrow: 152)
   leaf: 0x180025b 25166427 (56: nrow: 152 rrow: 152)
   leaf: 0x180025f 25166431 (57: nrow: 152 rrow: 152)
   leaf: 0x1800263 25166435 (58: nrow: 1 rrow: 1)
----- end tree dump


The number of leaf blocks has dropped from 72 to 60 – I didn’t think that this could happen without an index coalesce or rebuild, but maybe it’s a special feature of LOBINDEXes or maybe it’s a new feature of B-trees in general that I hadn’t noticed. Some of the “known empty” leaf blocks seem to have been taken out of the structure.

We still see the half full / full split between the leaf blocks for the first 3 freepools when compared to the top freepool.

There are still some empty leaf blocks (rrow = 0), but apart from the top freepool no more than one per freepool for the other sections that are indexing LOBs.

The section of index that is the freepool 2 section for “reusable” chunks shows an interesting anomaly. There are some leafblocks that are now empty (rrow=0) but were only holding a few index entries (nrow=1-4 rather than the 75 – 140 entries that we saw in the previous installment) at the moment they were last updated; this suggests a certain level of contention with problems of read-consistency, cleanout, and locking between processes trying to reclaim reusable blocks.

It’s just slightly surprising the the top freepool shows several empty leaf blocks – is this just a temporary coincidence, or a boundary case that means the blocks will never be cleaned and re-used; if it’s a fluke will a similar fluke also reappear (eventually) on the other freepools. Is it something to do with the fact that freepool 2 happened to be the freepool that got the first lot of reusable chunks ? Clearly we need to run a few more cycles of deletes and inserts to see what happens.

We have one important conclusion to make but before we make it let’s look at the partial key “col 0” values in the row directory of the root block just to confirm that the breaks I’ve listed above do correspond to each of the separate freepool sections:

 0:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 09 8f
 1:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0b
 2:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0b bc
 3:     col 0; len ..; (..):  00 00 00 01 00 00 09 db 0d
 4:     col 0; len 10; (10):  00 00 00 01 00 00 09 db 0d 51
 5:     col 0; len 10; (10):  00 00 00 01 00 00 09 db bf f4
 6:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c0 77
 7:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c1 90
 8:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 77
 9:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c2 fa
10:     col 0; len 10; (10):  00 00 00 01 00 00 09 db c4 45
11:     col 0; len ..; (..):  00 00 00 01 00 00 09 db c5

12:     col 0; len 10; (10):  00 02 00 01 00 00 09 da fb 74
13:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 08 d9
14:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 09 c0
15:     col 0; len ..; (..):  00 02 00 01 00 00 09 db 0b
16:     col 0; len 10; (10):  00 02 00 01 00 00 09 db 0b ee
17:     col 0; len 10; (10):  00 02 00 01 00 00 09 db bf 8b
18:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c0 a4
19:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c2 21
20:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c3 6c
21:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c4 21
22:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c5 9e
23:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 53
24:     col 0; len 10; (10):  00 02 00 01 00 00 09 db c6 d6

25:     col 0; len 10; (10):  00 04 00 01 00 00 09 da fd fb
26:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 08 38
27:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0a 19
28:     col 0; len ..; (..):  00 04 00 01 00 00 09 db 0b
29:     col 0; len 10; (10):  00 04 00 01 00 00 09 db 0c 7d
30:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc 64
31:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bc b5
32:     col 0; len ..; (..):  00 04 00 01 00 00 09 db bd
33:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd 51
34:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd a2
35:     col 0; len 10; (10):  00 04 00 01 00 00 09 db bd f3
36:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 44
37:     col 0; len 10; (10):  00 04 00 01 00 00 09 db be 95

38:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
39:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
40:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
41:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
42:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
43:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
44:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
45:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00
46:     col 0; len 10; (10):  00 05 57 b4 d3 7d 00 00 00 00

47:     col 0; len 10; (10):  00 06 00 01 00 00 09 da fe d4
48:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 00 ca
49:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 03 24
50:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 05 4c
51:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 07 a6
52:     col 0; len ..; (..):  00 06 00 01 00 00 09 db 0a
53:     col 0; len 10; (10):  00 06 00 01 00 00 09 db 0c 5a
54:     col 0; len 10; (10):  00 06 00 01 00 00 09 db bf da
55:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c1 6c
56:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c2 cc
57:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c4 90
58:     col 0; len 10; (10):  00 06 00 01 00 00 09 db c6 22

I’ve broken the list and numbered the entries to match the treedump above, so it’s each to check that leaf blocks 38 to 46 are the now empty blocks for the reusable chunks. We started the reload with 3,001 entries for reusable chunks all in one freepool; we’ve ended it with none. Something has “stolen” the reusable chunks from freepool 2 so that they could be used for creating new LOBs that were distributed across all the freepools.

Oracle has been very efficient about re-using the index space, with a little bit of wastage creeping in, perhaps caused by coincidences in timing, perhaps by some code that avoids waiting too long when trying to lock index entries. We have a contention point because of the single threaded delete – but it doesn’t appear to be a disaster for space utilisation. Of course we need to look at the level of contention, and repeat the cycle a few times, changing the freepool used for deletion fairly randomly to see if we just got lucky or if the first few deletes are special cases. We can also ask questions about how the “stealing” takes place – does a process steal one index entry at a time, or does it take several consecutive index entries from the same block while it’s got the leaf block locked – but perhaps we don’t really need to know the fine details, the amount of time spent in contention (typically TX waits of some sort) could tell use whether or not we had a significant problem.

Contention and Resources

For each of the processes running the inserts I took a couple of snapshots – session stats and wait events – to see if anything interesting showed up. Naturally, the closer you look the more strange things you find. Here are a few sets of numbers from v$session_event and v$sesstat (in my snapshot format – with the four sessions always reported in the same order);

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: HW - contention                                985           0          93.15        .095           1
enq: HW - contention                                 10           0           5.46        .546           1
enq: HW - contention                              1,001           0         102.27        .102           1
enq: HW - contention                              1,010           0         106.12        .105           1

db file sequential read                           1,038           0          40.75        .039           2
db file sequential read                              39           0           3.21        .082           1
db file sequential read                           1,038           0          28.33        .027           1
db file sequential read                           1,046           0          34.89        .033           1

Name                                                                     Value
----                                                                     -----
physical reads                                                           1,038
physical reads direct                                                      979

physical reads                                                              39
physical reads direct                                                       19

physical reads                                                           1,038
physical reads direct                                                      998

physical reads                                                           1,046
physical reads direct                                                    1,005

session logical reads                                                  114,060
session logical reads                                                   22,950
session logical reads                                                  104,555
session logical reads                                                   93,173

data blocks consistent reads - undo records applied                      2,165
data blocks consistent reads - undo records applied                        119
data blocks consistent reads - undo records applied                      1,222
data blocks consistent reads - undo records applied                        193

My first thought when looking at the wait events was to get an idea of where most of the time went, and I had expected the HW enqueue to be the most likely contender: this enqueue is held not only when the high water mark for a segment is moved, it’s also held when a process is doing any space management for inserting a LOB. So my first suprise was that one session was hardly waiting at all compared to the other sessions.

Then I noticed that this one session was also suffering a very small number of “db file sequential read” waits compared to every other session – but why were ANY sessions doing lots of db file sequential reads: the LOB was declared as nocache so any reads ought to be direct path reads and although Oracle doesn’t always have to wait for EVERY direct path read we should have read (and rewritten) 1,500 “reusable” LOB chunks by direct path reads in each session – I refuse to believe we never waited for ANY of them. So take a look at the session stats: which show us the that the “db file sequential read” waits match exactly with the “physical reads” count but most of the “physical reads” are recorded “physical reads direct” – Oracle is recording the wrong wait event while reading the “reusable” chunks.

Okay, so our direct path read waits are being recorded incorrectly: but one session does hardly any physical reads anyway – so what does that mean ? It means the process ISN’T reusing the chunks – you can’t be reusing chunks if you haven’t t read them. But the dumps from the index tell us that all the reusable chunks have been reused – so how do we resolve that contradiction ? Something is reading the index to identify some reusable chunks, wiping the reference from the index, then not using the chunks so (a) we’ve got some reusable chunks “going missing” and (b) we must be allocating some new chunks from somewhere – maybe bumping the high water mark of the segment, maybe allocating new extents.

Fortunately I had used the dbms_space package to check what the lob segment looked like after I had loaded it. It was 8192 blocks long, with 66 blocks shown as unused and 8,000 (that’s exactly 2 blocks/chunks per LOB) marked as full. After the delete/insert cycle is was 8,576 blocks long, with 8,000 blocks marked as full and 444 marked as unused. We had added three extents of 1MB each that we didn’t need, and one session seems to have avoided some contention by using the new extents for (most of) its LOBs rather than competing for the reusable space with the other LOBs.

Was this a one-off, or a repeatable event. How bad could it get ?



Is there a way of discovering from SQL (perhaps with a low-cost PL/SQL function) the freepool for a LOB when it’s defined as Basicfile. You can get the LOBid for a Securefiles LOB using the dbms_lobutil package and the LOBid includes the critical first two bytes – but the package is only relevant to Securefiles. I rather fancy the idea of a process knowing which freepool it is associated with and only deleting LOBs that come out of that freepool.


A thought about knowing the freepool – you could always add the freepool number as a column to the base table and have a “before row insert” trigger (or some other coding strategy) that populated the column with the freepool id. This would require the application to know a little too much about the implementation and about the internals, and would require a code change somewhere if anyone decided to change freepools. (Possibly VPD/RLS/FGAC could be used to hide the details from the outside world). The code to delete LOBs could then delete only those rows that matched its current freepool. WARNING – this is an idle thought, probably not relevant to (most) people, and there may be better strategies to make the effort irrelevant.

Following a little accident with another test I noticed a pattern that suggested that you would only reuse space on your own freepool and might, at the same time, at the same time moving some of your freepool into segment freespace. After a couple of experiments though it looks as if you use the reusable space in your freepool; if you have no reusable space left in your freepool you start using the segment freespace (ignoring the reusable space in other freepools); if there is no free space in the segment you raid EVERY OTHER FREEPOOL and transfer all their reusable space to segment freespace – which is bad news if there are 200,000 reusable chunks in their freepools. Unfortunately, at the same time, you add an extent to the segment and insert your new LOB into that extent.  There’s still plenty of scope for refining details here; I’ve only tried pushing one session into raiding the other freepools, so I don’t know what happens if there are multiple sessions trying to create LOBs while the raid is going on, who blocks whom, for how long, and what happens about new extents etc. Clearly, though, there’s plenty of scope for nasty things to happen.



May 3, 2016


Filed under: CBO,compression,Execution plans,Infrastructure,Oracle,Uncategorized — Jonathan Lewis @ 8:11 am BST May 3,2016

The OTN database forum supplied a little puzzle a few days ago – starting with the old, old, question: “Why is the plan with the higher cost taking less time to run?”

The standard (usually correct) answer to this question is that the optimizer doesn’t know all it needs to know to predict what’s going to happen, and even if it had perfect information about your data the model used isn’t perfect anyway. This was the correct answer in this case, but with a little twist in the tail that made it a little more entertaining. Here’s the query, with the two execution plans and the execution statistics from autotrace:

        client_account d, 
        client         e, 
        account        a
        A.acct_nb ='00000000000000722616216'</li>


Plan (A) with a full tablescan of client_account – cost 808, runtime 1.38 seconds, buffer gets 17,955

| Id | Operation                      | Name           | Rows  | Bytes  | Cost (%CPU)| Time     |
|  0 | SELECT STATEMENT               |                |     1 |    59  |   808 (14) | 00:00:10 |
|  1 |  NESTED LOOPS                  |                |     1 |    59  |   808 (14) | 00:00:10 |
|  2 |   NESTED LOOPS                 |                |     1 |    59  |   808 (14) | 00:00:10 |
|* 3 |    HASH JOIN                   |                |     1 |    42  |   806 (14) | 00:00:10 |
|  4 |     TABLE ACCESS BY INDEX ROWID| ACCOUNT        |     1 |    30  |     5  (0) | 00:00:01 |
|* 5 |      INDEX RANGE SCAN          | XAK1ACCOUNT    |     1 |        |     4  (0) | 00:00:01 |
|  6 |     TABLE ACCESS FULL          | CLIENT_ACCOUNT |  9479K|   108M |   763 (10) | 00:00:09 |
|* 7 |    INDEX UNIQUE SCAN           | XPKCLIENT      |     1 |        |     1  (0) | 00:00:01 |
|  8 |   TABLE ACCESS BY INDEX ROWID  | CLIENT         |     1 |    17  |     2  (0) | 00:00:01 |

     0  recursive calls
     0  db block gets
 17955  consistent gets
     0  physical reads
     0  redo size
   623  bytes sent via SQL*Net to client
   524  bytes received via SQL*Net from client
     2  SQL*Net roundtrips to/from client
     0  sorts (memory)
     0  sorts (disk)
     1  rows processed

Plan (B) with an index fast full scan on a client_account index – cost 1,190, runtime 0.86 seconds, buffer gets 28696

| Id | Operation                      | Name              | Rows  | Bytes  | Cost (%CPU)| Time     |
|  0 | SELECT STATEMENT               |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|  1 |  NESTED LOOPS                  |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|  2 |   NESTED LOOPS                 |                   |     1 |    59  |  1190  (8) | 00:00:14 |
|* 3 |    HASH JOIN                   |                   |     1 |    42  |  1188  (8) | 00:00:14 |
|  4 |     TABLE ACCESS BY INDEX ROWID| ACCOUNT           |     1 |    30  |     5  (0) | 00:00:01 |
|* 5 |      INDEX RANGE SCAN          | XAK1ACCOUNT       |     1 |        |     4  (0) | 00:00:01 |
|  6 |     INDEX FAST FULL SCAN       | XPKCLIENT_ACCOUNT | 9479K |   108M |  1145  (5) | 00:00:13 |
|* 7 |    INDEX UNIQUE SCAN           | XPKCLIENT         |     1 |        |     1  (0) | 00:00:01 |
|  8 |   TABLE ACCESS BY INDEX ROWID  | CLIENT            |     1 |    17  |     2  (0) | 00:00:01 |

     0  recursive calls
     0  db block gets
 28696  consistent gets
     0  physical reads
     0  redo size
   623  bytes sent via SQL*Net to client
   524  bytes received via SQL*Net from client
     2  SQL*Net roundtrips to/from client
     0  sorts (memory)
     0  sorts (disk)
     1  rows processed

Note, particularly, that the two plans are the same apart from operation 6 where a full tablescan changes to an index fast full scan, predicting the same number of rows but with an increase of 50% in the cost; the increase in cost is matched by an increase in the reported workload – a 60% increase in the number of consistent reads and no disk reads or recursive SQL in either case. Yet the execution time (on multiple repeated executions) dropped by nearly 40%.

So what’s interesting and informative about the plan ?

The cost of a tablescan or an index fast full scan is easy to calculate; broadly speaking it’s “size of object” / “multiblock read count” * k, where k is some constant relating to the hardware capability. The costs in these plans and the autotrace statistics seem to be telling us that the index is bigger than the table, while the actual run times seem to be telling us that the index has to be smaller than the table.

It’s easy for an index to be bigger than its underlying table, of course; for example, if this table consisted of nothing but two short columns the index could easily be bigger (even after a rebuild) because it would be two short columns plus a rowid. If that were the case here, though, we would expect the time to fast full scan the index to be higher than the time to scan the table.

So two thoughts crossed my mind as I looked at operation 6:

  • Mixing block sizes in a database really messes up the optimizer costing, particularly for tablescans and index fast full scans. Maybe the table had been built in a tablespace using 32KB  blocks while the index had been built in a tablespace using the more common 8KB blocksize – I didn’t want to start working out the arithmetic but that might be just enough to produce the contradiction.
  • Maybe the table was both bigger AND smaller than the index – bigger because it held more data, smaller because it had been compressed. If so then the difference in run-time would be the overhead of decompressing the rows before projecting and comparing the data.

Conveniently the OP has included an extract from the 10053 trace:

Table Stats::
  Table: CLIENT_ACCOUNT  Alias:  D
    #Rows: 9479811  #Blks:  18110  AvgRowLen:  71.00  ChainCnt:  0.00
  Column (#1): CLNT_ID(
    AvgLen: 6 NDV: 1261035 Nulls: 0 Density: 0.000001 Min: 0 Max: 4244786
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 239
  Column (#2): ACCT_ID(
    AvgLen: 6 NDV: 9479811 Nulls: 0 Density: 0.000000 Min: 1 Max: 22028568
    Histogram: HtBal  #Bkts: 254  UncompBkts: 254  EndPtVals: 255

Index Stats::
  Index: XPKCLIENT_ACCOUNT  Col#: 1 2
    LVLS: 2  #LB: 28543  #DK: 9479811  LB/K: 1.00  DB/K: 1.00  CLUF: 1809449.00

Note that the index is called xpclient_account – which suggests “primary key” –  and the number of distinct keys in the index (#DK) matches the number of rows in the table(#Rows). The index and table stats seem to be consistent so we’re not looking at a problem of bad statistics.

Now to do some simple (ballpark) arithmetic: for the table can we check if  “rows * average row length / 8K =  blocks”. We can read the numbers directly from the trace file:  9,500,000 * 71 / 8,000 = 84,000.  It’s wrong by a factor of about 4 (so maybe it’s a 32K block, and maybe I could rule out that possibility by including more detail in the arithmetic – like allowing properly for the block header, row overheads, pctfree etc).

For the index – we believe it’s the primary key, so we know the number of rows in the index – it’s the same as the number of distinct keys. As for the length of an index entry, we have the index definition (col#: 1 2) and we happen to have the column stats about those columns so we know their average length. Allowing for the rowid and length bytes we can say that the average index entry is (6 +1) + (6 + 1) + 6 = 20 bytes.  So the number of leaf blocks should be roughy 9,500,000 * 20 / 8,000 = 23,750. That’s close enough given the reported 28,543 and the fact that I haven’t bothered to worry about row overheads, block overheads and pctfree.

The aritmetic provides an obvious guess – which turned out to be correct: the table is compressed, the index isn’t. The optimizer hasn’t allowed for the CPU cost of decompressing the compressed rows, so the time required to decompress 9.5M rows doesn’t appear in the execution plan.


Looking at the column stats, it looks like there are roughly 8 acct_ids for each clnt_id, so it would probably be sensible to compress the primary key index (clnt_id, acct_id) on the first column as this would probably reduce the size of the index by about 20%.

Better still – the client_account table has very short rows – it looks like a typical intersection table with a little extra data carried. Perhaps this is a table that should be an index-organized table with no overflow. It looks like there should also be an index (acct_id, clnt_id) on this table to optimse the path from account to client and this would become a secondary index – interestingly being one of those rare cases where the secondary index on an IOT might actually be a tiny bit smaller than the equivalent index on a heap table because (in recent versions of Oracle) primary key columns that are included in the secondary key are not repeated in the index structure. (It’s a little strange that this index doesn’t seem to exist already – you might have expected it to be there given the OP’s query, and given that it’s an “obvious” requirement as an index to protect the foreign key.)

The only argument against the IOT strategy is that the table clearly compresses very well as a heap table, so a compressed heap table plus two B-tree indexes might be more cost-effective than an IOT with a single secondary index.


December 15, 2015


Filed under: Uncategorized — Jonathan Lewis @ 11:22 am BST Dec 15,2015

A recent question on the OTN database forum asked:

I have a table with a,b,c,d,e,f,g,h,i,j,k. columns and I have an index on (a,b) columns. There is a sql statement now with “where a= ?” and we are wondering if it could also be good to add a single index on just (a).

Does it help at all? Does it help in some cases?

This is one of those questions where the answer for a perfectly designed and managed system could easily contradict the pragmatic answer for a live system in its current state. That may mean you have to do the wrong thing in the short term while working (possibly very slowly) towards the right thing.  I gave the following (slightly edited) answer on the forum:

The basic answer is that you do NOT need the single column index if you have the two-column index.

The complex answer is that you may have to spend some time and effort ensuring that the two-column index is used in all cases where it would have been appropriate to use the single column index. This may simple mean ensuring the clustering_factor of the index is adjusted suitably so that the optimizer “likes” the index enough it may mean you (also) have to modify some code to include the cluster_by_rowid hint (when you’re at 12c) so that you don’t suffer a performance impact at run-time.

Key factors to consider: the two-column index will be physically larger than the single column index – this will increase the (optimizer’s estimated) cost of using it; the clustering_factor of the two-column index will almost certainly be larger than the clustering_factor of the single column index – this will also increase the (optimizer’s estimated) cost of using it.

These two points are echoed at run-time: the two column index will be bigger so you will have to do more work (though probably not very much more work) to read the relevant rowids and, if you walk the two-column index in order for a given value of the first column, you will visit the table blocks in a different order compared to the order of visits from the single column index – this may result in the query actually doing noticeably more work at run-time.

The change in the index leaf_block count is often insignificant (especially if, as per your example, the number of rows required – hence blocks visited in the table – is large); the impact of the clustering_factor can make a dramatic difference to the cost calculations; but you can often work around this. In, particularly, you can use the dbms_stats.set_table_prefs() call to set the ‘table_cached_blocks’ parameter for a table so that all its indexes look more desirable to the optimizer.

Bottom line: you don’t need the single column index but if you currently have it and want to drop it the human effort required to ensure that it can be dropped without side effects may make you decide to keep it anyway, especially if it doesn’t seem to be causing any concurrency or other performance overheads.  If you don’t have it yet, then you shouldn’t need to create it – though you might have to do some work to make sure that the optimizer takes full advantage of the two-column index.

Since I’m on the topic, I’ll add that the same arguments apply to a pair of indexes like (a, b, c) and (a, b); if you’ve got the longer index you shouldn’t need the shorter one; however, because the shorter index is a multi-column index, you might find that it’s beneficial to create a column group on that column combination so that the optimizer doesn’t lose information about the number of distinct values for the combination when you drop the index.

December 3, 2015

Five Hints

Filed under: Hints,Oracle,Uncategorized — Jonathan Lewis @ 7:40 am BST Dec 3,2015

This is the content of a “whitepaper” I wrote for my presentation “Five Hints for Optimising SQL” at the recent DOAG conference.


Adding hints to production code is a practice to be avoided if possible, though it’s easy to make the case for emergency patching, and hinting is also useful as the basis of a method of generating SQL Plan Baselines. However, notwithstanding (and sometimes because of) the continuing enhancements to the optimizer, there are cases where the only sensible option for dealing with a problem statement is to constrain the broad brush strategy that the optimizer can take in a way that allows it to find a reasonable execution plan in a reasonable time.

This note describes in some detail the use and effects of five of the “classic” hints that I believe are reasonable strategic options to redirect the optimizer when it doesn’t choose a path that you consider to be the most appropriate choice.

The Big Five

At the time of writing, a query against the view v$sql_hint on Oracle reports 332 hints – but there are very few which we should really consider as safe for production code, and it’s best to view even those as nothing more than a medium-term tool to stabilise performance until the optimizer is able to do a better job with our SQL.

The handful of hints that I tend to rely on for solving problems is basically a set of what I call “structural” queries though in recent years it has become appropriate to label them as “query block” hints. These are hints that give the optimizer some idea of the shape of the best plan without trying to enforce every detail of how it should finalize the plan. The hints (with their negatives where appropriate) are:

  • Unnest / no_unnest — Whether or not to unnest subqueries
  • Push_subq / no_push_subq — When to handle a subquery that has not been unnested
  • Merge / no_merge — Whether to use complex view merging
  • Push_pred / no_push_pred — What to do with join predicates to non-merged views
  • Driving_site — Where to execute a distributed query

Inevitably there are a few other hints that can be very helpful, but a key point I want to stress is that for production code I avoid what I call “micro-management” hints (such as use_nl(), index_rs_asc()) – attempts to control the optimizer’s behaviour to the last little detail; it is very easy to produce massive instability in performance once you start down the path of micro-managing your execution plans, so it’s better not to try.

The rest of this document will be devoted to describing and give examples of these hints.

The Optimizer’s Strategy

You can think of the optimizer as working on a “unit of optimization” which consists of nothing more than a simple statement of the form:

select  list of columns
from    list of tables
where   list of simple predicates

To deal with a more complex query the optimizer stitches together a small number (reduced, if it had its way, to just one) of such simple blocks. So one of the first steps taken by the optimizer aims to transform your initial query into a this simple form. Consider this example:

                t2.n1, t3.n2, count(*)
        from    t2, t3
        where exists (
                from    t5
                where = t2.n1
        and     t3.n1 = t2.n2
        group by t2.n1, t3.n2
        )       v1,
        v1.n1 = t1.n1
and     t4.n1(+) = v1.n1

We have an inline view consisting of a two-table join with a subquery correlated to the first table, and from our perspective we have a “simple join” of three objects – t1, v1, and t4. Before it does anything else the optimizer will try to transform this into a straight-line five-table join so that it can join all the tables in order one after the other. As part of that process it will generally attempt to eliminate subqueries in a processing known as unnesting.

Looking at the query as it has been presented author of the code may have been thinking (symbolically) of the underlying problem as:

  • ( ( t1, ( ( t2, subquery t5 ), t3 ) ), t4 )

Take t1, join to it the result of applying the subquery to t2 and joining t3, then join t4.

The optimizer may decide to transform to produce the following:

  • ( ( ( ( t1, t2 ), t3 ), {unnested t5} ), t4 )

Join t2 to t1, join t3 to the result, join the transformed t5 to the result, then join t4 to the result.

If I decide that the original layout demonstrates the appropriate mechanism, my target is to supply the optimizer with just enough hints to lock it into the order and strategy shown, without trying to dictate every little detail of the plan. My hints would look like this:

            qb_name(main) push_pred(v1@main)
            no_unnest(@subq1) push_subq(@subq1)
        select  /*+ qb_name(inline) */
                t2.n1, t3.n2, count(*)
        from    t2, t3
        where exists (
                select  /*+ qb_name(subq1) */
                from    t5
                where = t2.n1
        and     t3.n1 = t2.n2
        group by t2.n1, t3.n2
        )       v1,
        v1.n1 = t1.n1
and     t4.n1(+) = v1.n1

I’ve labelled the three separate select clauses with a query block name (qb_name() hint), told the optimizer that the query block named “inline” should be considered as a separately optimized block (no_merge(@inline)), and the subquery inside that block called “subq1” should be treated as a filter subquery (no_unnest(@subq1)) and applied as early as possible (push_subq(@subq1)).

In some circumstances I might use one more hint to tell the optimizer to consider a single join order: t1, v1, t4 using the hint /*+ leading(t1 v1 t4) */; but in this case I’ve told the optimizer to push the join predicate v1.n1 = t1.n1 inside the view (push_pred(@inline)) – which will make the optimizer do a nested loop from table t1 to view v1, resolving the view for each row it selects from t1.

Having captured 4 of the “big 5” hints in one sample statement, I’ll now comments on each of them (and the final driving_site() hint separately).

Merge / No_merge

This pair of hints apply particularly to “complex view merging”, but can be used to “isolate” sections of a query, forcing the optimizer to break one large query into a number of smaller (hence easier) sections. I see two main uses for the hints (and particularly the no_merge option) – one is to help the optimizer get started when handling a query with a large number of table, the other is simply to block a strategy that the optimizer sometimes chooses when it is a bad move.

Consider, in the first case, a query involving 20 tables, with several subqueries. With such a long list it is very easy for the optimizer to pick a very bad starting join order and never reach a good join order; moreover, because of the multiplicative way in which the optimizer estimates selectivity it’s very easy for the optimizer to decide after a few tables that the cardinality of the join so far is so small that it doesn’t really matter which table to access next. In cases like this we might start by writing a simpler query joining the first four of five tables that we know to be the key to the whole query – once we have got the core of the query working efficiently we can “wrap” it into an inline view with a no_merge hint, and then join the rest of the tables to it, with some confidence that the optimizer will start well and that it can’t go far wrong with the remainder of the tables so, for example

select  ...
from    t1, t2, t3, ..., t20
where   {various predicates}
and     exists {correlated subquery1}
and     exists {correlated subquery2}
and     column in {non-correlated subquery}

Might become

with v1 as (
        select  /*+ no_merge cardinality(2000) */ ...
        from    t1, t2, t3, t4, t5
        where   {various predicates{
        and     exists {correlated subquery1}
select  ...
from    v1, t6, t7, ..., t20
where   {join conditions to v1}
and     {other join conditions}
and     exists {correlated subquery2}
and     column in {non-correlated subquery}

I’ve written the example up using subquery factoring; in earlier versions of Oracle the relevant piece of code would have been written as an inline view, but the “with” clause can help to tidy the SQL up and make it easier to see the logic of what’s being done – provided the practice isn’t taken to such extremes that the final query consists of large number of very small factored subqueries.

I’ve included a cardinality() hint in the factored subquery – it’s not fully documented, and it’s not commonly realised that it can be applied to a query block rather than to a table or list of tables. This query block usage is probably the safest example of using the hint – the table-related usage is badly understood and prone to mis-use.

As an example of blocking a badly selected transformation, consider the following query (where I’ve already included qb_name() hints to name the two separate query blocks):

select  /*+ qb_name(main) */
        t1.vc1, avg_val_t1
from    t1,
        Select  /*+ qb_name(inline) */
                id_parent, avg(val) avg_val_t1 
        from	t2
        group by
        ) v1
        t1.vc2 = 'XYZ'
and     v1.id_parent = t1.id_parent

There are two basic strategies the optimizer could use to optimize this query, and the choice would depend on its estimate of how much data it had to handle . Whichever choice it makes we might, depending on our better understanding of the data, want it to choose the alternative (without rewriting the query beyond hinting it).

One option is for Oracle to execute the inline view to generate the aggregate data v1 then join the result to t1; the other is to join t2 (the view’s underlying table) to t1 and then work out an aggregation of the join that would give the same result.

If I want to “join then aggregate” I would use the merge hint, if I wanted to “aggregate then join” I would use the no_merge hint. There are three different ways in which I could introduce the hint:

  • In the inline view itself I could simply add the hint “merge”
  • In the main query I could reference the view by view name “no_merge(v1)”
  • In the main query I could reference the inline query block name “no_merge(@inline)”

Note particularly the “@” symbol that I use to point a hint at a query block; and note that this was not needed when I reference the view name. (The reference by query block name is the more modern, preferred strategy.)

Push_pred / No_push_pred

Once we start dealing with non-mergeable views and have to join to them there are two strategies that we could use for the join; the first is (nominally) to create the entire data set for the view and then use that in a merge join or hash join based on the join predicate, or we could “push a join predicate” into the view definition – in other words for each join value we could add a simple filter predicate to the view definition and derive the view result based on that predicate. For example, if we create a database view called avg_val_view with a definition matching the inline view we used in the previous example, we might see one of two possible execution plans for the following query:

select  t1.vc1, avg_val_t1
from    t1, avg_val_view
where   t1.vc2 = 'XYZ'
and     avg_val_view.id_parent = t1.id_parent

First – if the view is non-mergeable and we don’t push the predicate, we can see the join predicate appearing at operation 1, as we do a hash join between table t1 and the entire result set from aggregating t2. This may be sensible, but it may be very expensive to create the entire aggregate:

| Id | Operation            | Name         | Rows  | Bytes | Cost  |
|  0 | SELECT STATEMENT     |              |     1 |    95 |    27 |
|* 1 |  HASH JOIN           |              |     1 |    95 |    27 |
|* 2 |   TABLE ACCESS FULL  | T1           |     1 |    69 |     2 |
|  3 |   VIEW               | AVG_VAL_VIEW |    32 |   832 |    24 |
|  4 |    HASH GROUP BY     |              |    32 |   224 |    24 |
|  5 |     TABLE ACCESS FULL| T2           |  1024 |  7168 |     5 |

Predicate Information (identified by operation id):
   1 - access("AVG_VAL_VIEW"."ID_PARENT"="T1"."ID_PARENT")
   2 - filter("T1"."VC2"='XYZ')

So we may decide to add the hint /*+ push_pred(avg_val_view) */ to the query – we have to use the view-name method since we don’t have a query block containing the view; if we were using the inline view from the previous query we could have used the “query block” format /*+ push_pred(@inline) */. The plan from pushing predicates is:

| Id | Operation               | Name        | Rows | Bytes | Cost |
|  0 | SELECT STATEMENT        |             |    1 |    82 |    7 |
|  1 |  NESTED LOOPS           |             |    1 |    82 |    7 |
|* 2 |   TABLE ACCESS FULL     | T1          |    1 |    69 |    2 |
|  3 |   VIEW PUSHED PREDICATE | AVG_VAL_VIEW|    1 |    13 |    5 |
|* 4 |    FILTER               |             |      |       |      |
|  5 |     SORT AGGREGATE      |             |    1 |     7 |      |
|* 6 |      TABLE ACCESS FULL  | T2          |   32 |   224 |    5 |

Predicate Information (identified by operation id):
   2 - filter("T1"."VC2"='XYZ')
   4 - filter(COUNT(*)>0)
   6 - filter("ID_PARENT"="T1"."ID_PARENT")

It would actually be a bad idea in this particular case, but if we could access the rows for a given id_parent in t2 efficiently this query could be much more efficient than the previous plan because it would only aggregate the small number of rows that it was going to need at each point, with the smallest row size.

You might note that Oracle has cleverly introduced a filter as operation 4 to eliminate t1 rows where the aggregate would return a row with a zero when there was no matching data. It’s details like this that typical programmers tend to forget when trying to transform SQL by hand.

Unnest / No_unnest

The optimizer prefers joins to subqueries, and will generally try to transform a query to turn a subquery into a join – which often means a semi-join for existence/in, or an anti-join for not exists/not in). As the optimizer has improved with version many such transformations (or decisions to not transform) changed from being driven by rules to being driven by cost – and sometimes we want to override the optimizer because we know its cost calculation is bad. Most commonly we might want to write a query with a subquery – to show our intentions – but tell the optimizer to unnest the subquery: it’s much safer to take this approach rather than to rewrite the query in unnested form ourselves – I’ve seen people do the rewrite incorrectly too many times to trust a user-created rewrite. For example:

        /*+ qb_name(main) unnest(@subq) */
        emp outer
where   outer.sal > (
                        /*+ qb_name(subq) unnest */
                from    emp inner 
                inner.dept_no = outer.dept_no

I’ve show the unnest hint here, and demonstrated the two possible forms – you can either use it in the main query block hint to point it at a give query block name (@subq), or you can use it without a “parameter” in the query block you want unnested. In effect the unnest hint causes Oracle to rewrite the query as:

                dept_no, avg(sal) av_sal
        from    emp 
        group by 
        )               inner,
        emp             outer
        outer.dept_no = inner.dept_no
and     outer.sal > inner.av_sal

You’ll notice that this gives us an in-line aggregate view, so the optimizer could take (or be pushed) one more step into doing complex view merging as well, joining emp to itself before aggregating on a very messy set of columns.

Here’s the plan if we unnest:

| Id  | Operation            | Name    | Rows  | Bytes | Cost  |
|   0 | SELECT STATEMENT     |         |  1000 | 98000 |   114 |
|*  1 |  HASH JOIN           |         |  1000 | 98000 |   114 |
|   2 |   VIEW               | VW_SQ_1 |     6 |   156 |    77 |
|   3 |    HASH GROUP BY     |         |     6 |    48 |    77 |
|   4 |     TABLE ACCESS FULL| EMP     | 20000 |   156K|    36 |
|   5 |   TABLE ACCESS FULL  | EMP     | 20000 |  1406K|    36 |

Predicate Information (identified by operation id):
   1 - access("ITEM_1"="OUTER"."DEPT_NO")

Notice the appearance at operation 2 of a “view” names VW_SQ_1: there are a number of internal view names that appear in Oracle as it transforms queries – the fact that a view name starts with VW_ is a good clue that it’s an internal one. Note, in this particular case that the main work done in the query is the two tablescans of EMP.

Here’s the plan if we don’t unnest:

| Id  | Operation           | Name | Rows  | Bytes | Cost  |
|   0 | SELECT STATEMENT    |      |   167 | 12024 |   252 |
|*  1 |  FILTER             |      |       |       |       |
|   2 |   TABLE ACCESS FULL | EMP  | 20000 |  1406K|    36 |
|   3 |   SORT AGGREGATE    |      |     1 |     8 |       |
|*  4 |    TABLE ACCESS FULL| EMP  |  3333 | 26664 |    36 |

Predicate Information (identified by operation id):
   1 - filter("OUTER"."SAL"> (SELECT /*+ NO_UNNEST */
              AVG("INNER"."SAL") FROM "EMP" "INNER" 
              WHERE "INNER"."DEPT_NO"=:B1))
   4 - filter("INNER"."DEPT_NO"=:B1)

The FILTER at operation 1 tells us that nominally the optimizer will run the subquery once for every row in the emp table, but the optimizer costing (252) tells us that it thinks that really it will execute the table scan only 7 times in total (7 * 36 = 252): once for the driving scan and six more times because there are only six departments in my emp table.  (This “how many executions” type of estimate appeared in the costing calculations relatively recently.)

Push_subq / No_push_subq

Once we can control whether or not Oracle will unnest a subquery or run it as a filter we can then choose whether the subquery should run early or late. Historically the optimizer would always leave subqueries to the very end of query operation – but recently the choice of timing acquired a costing component. “Pushing” a subquery means pushing it down the execution tree – i.e. running it earlier in the plan. To demonstrate this we need a minimum of a two-table join with subquery:

        /*+ leading(t1 t2) push_subq(@subq) */
from    t1, t3
where   t1.n2 = 15
and     exists (
                select  --+ qb_name(subq) no_unnest push_subq 
                from    t2
                where   t2.n1 = 15
                and =
and     t3.n1 = t1.n1
and     t3.n2 = 15

In this query I have a subquery where I’ve blocked unnesting, so it has to run as a filter subquery (in passing, I’ve use the alternative, less commonly known, format for hinting: the single-line hint/comment that starts with – – for a comment and – – + for a hint).

I’ve shown the push_subq hint (run the subquery early) in two different ways – first at the top of the query referencing the query block that I want pushed, and then in the subquery itself where it doesn’t need a parameter.

As you can see, the subquery is correlated to table t1 and I’ve told Oracle to examine only the join order t1 -> t3. The effect of the push_subq hint, therefore, is to tell Oracle to run the subquery for each row of t1 that it examines and join any survivors to t3. The alternative is for Oracle to join t1 to t3 and then run the subquery for every row in the result. Depending on the data and indexes available either option might be the more efficient.

Here are the two plans – first if I don’t push the subquery (note the FILTER operation):

| Id | Operation                    | Name  | Rows | Bytes | Cost  |
|  0 | SELECT STATEMENT             |       |    1 |    28 |   289 |
|* 1 |  FILTER                      |       |      |       |       |
|* 2 |   HASH JOIN                  |       |  173 |  4844 |   116 |
|* 3 |    TABLE ACCESS FULL         | T1    |  157 |  3140 |    57 |
|* 4 |    TABLE ACCESS FULL         | T3    |  157 |  1256 |    57 |
|* 5 |   TABLE ACCESS BY INDEX ROWID| T2    |    1 |     8 |     2 |
|* 6 |    INDEX UNIQUE SCAN         | T2_PK |    1 |       |     1 |

Predicate Information (identified by operation id):
   1 - filter( EXISTS (SELECT /*+ QB_NAME ("SUBQ2") NO_UNNEST */ 0
              FROM "T2" "T2" WHERE "T2"."ID"=:B1 AND "T2"."N1"=15))
   2 - access("T3"."N1"="T1"."N1")
   3 - filter("T1"."N2"=15)
   4 - filter("T3"."N2"=15)
   5 - filter("T2"."N1"=15)
   6 - access("T2"."ID"=:B1)

Then if I push the subquery

| Id |Operation                     | Name  | Rows | Bytes | Cost  |
|  0 |SELECT STATEMENT              |       |    9 |   252 |   117 |
|* 1 | HASH JOIN                    |       |    9 |   252 |   115 |
|* 2 |  TABLE ACCESS FULL           | T1    |    8 |   160 |    57 |
|* 3 |   TABLE ACCESS BY INDEX ROWID| T2    |    1 |     8 |     2 |
|* 4 |    INDEX UNIQUE SCAN         | T2_PK |    1 |       |     1 |
|* 5 |  TABLE ACCESS FULL           | T3    |  157 |  1256 |    57 |

Predicate Information (identified by operation id):
   1 - access("T3"."N1"="T1"."N1")
   2 - filter("T1"."N2"=15 AND  EXISTS (SELECT /*+ QB_NAME ("SUBQ2")
              PUSH_SUBQ NO_UNNEST */ 0 FROM "T2" "T2" 
              WHERE "T2"."ID"=:B1 AND "T2"."N1"=15))
   3 - filter("T2"."N1"=15)
   4 - access("T2"."ID"=:B1)
   5 - filter("T3"."N2"=15)

Notice how the access to t2 has squeezed itself between t1 and t3 and is also indented one place as a clue that it is a subordinate action on t1, but the FILTER operation visible in the previous plan has disappeared. This plan is an example of a plan that doesn’t follow the well-known “first child first / recursive descent” guideline – Oracle has hidden the FILTER operation and twisted the plan slightly out of its “tradiational” shape as a consequence.


The final hint is for distributed queries, and has no “negative” version. Sometimes the only way you can “tune” a distributed query is to minimise the time spent on network traffic, and this means dictating WHERE the query executes. The driving_site hint lets you make that choice. (Sometimes, having made that choice you also have to include a leading() hint to tell Oracle about the single join order you want it to consider – it’s possible for the optimizer to do some very strange things with distributed queries, especially if the instances have different NLS settings).

Consider the following query (I’ll fill in the XXXX in the hint shortly):

select  /*+ driving_site (XXXX) */
        dist_home               dh,
        dist_away@remote_db     da
        dh.small_vc like '1%'
and =;

This query extracts a small amount of data from a table called DIST_HOME in the local database, and joins it to some data in a table called DIST_AWAY in a remote database, producing a reasonably large number of medium-sized rows. There are basically two obvious plans:

  • nested loop – for each row in dist_home, query dist_away for matching data
  • hash join – create an in-memory hash table from the dist_home data, and then probe it with data from all the rows in dist_away.

The first plan will produce a large number of network round trips – so that’s not very good; the second plan will pull a very large amount of data from the remote database if the query operates at the local database (it’s only the columns we need, but it will be ALL the rows from the remote database).

Choosing the second plan but executing it at the remote database means we’ll send a small parcel of data to the remote database, do the join there to produce (we hope) a reasonable result set, then send it back to the local database. The network traffic will be minimised without causing an undesirable increase in other resource usage. To make this plan happen all I needed to do in the query was change the XXXX in the driving_site() hint to reference a table alias from a table in the remote database, in this case driving_site(da).

Here’s the execution plan:

| Id | Operation              | Name     | Rows | Bytes | Inst |IN-OUT|
|  0 | SELECT STATEMENT REMOTE|          |  216 | 48600 |      |      |
|* 1 |  HASH JOIN             |          |  216 | 48600 |      |      |
|  2 |   REMOTE               | DIST_HOME|  216 |  4320 |    ! | R->S |
|  3 |   TABLE ACCESS FULL    | DIST_AWAY| 2000 |   400K| TEST |      |

Predicate Information (identified by operation id):
  1 - access("A1"."ID"="A2"."ID")

Remote SQL Information (identified by operation id):
      LIKE '1%' (accessing '!')

Notice how the top line (id 0) includes the keyword REMOTE – this tells you that this is the plan from the viewpoint of the remote database/instance that will be executing it. Remember that from its viewpoint the database that we think is the local database is one that it thinks is remote – hence the REMOTE operation 2 which is addressing (our) local table DIST_HOME.

Other key points to note are the appearance of the Inst (instance) and IN-OUT columns. These tell you where each table is located – when a query executes remotely “our” database is tagged only by the name “!”.

A nice feature of the execution plan for a distributed query is that you can see how the query has been decomposed for execution at the remote site. In this case the other database will be sending our database the query at operation 2 to pull the rows it wants from small_vc so that it can do the join at its site and send the result back to us.

The thing you generally don’t want to see in more complex distributed queries is a separate query being generated for each remote table involved in the join – tables that live remotely should be joined remotely with just the join result being pulled back to the local database.

There is a special warning that goes with this hint – it isn’t valid for the select statements in “create as select” and “insert as select”. There seems to be no good reason for this limitation, but for CTAS and “insert as select” the query has to operate at the site of the table that is receiving the data. This means that you may be able to tune a naked SELECT to perform very well and then find that you can’t get the CTAS to use the same execution plan. A typical workaround to this problem is to wrap the select statement into a pipelined function and do a select from table(pipelined_function).


There are literally hundreds of hints available but, as a general guideline, there are only a few that are particularly useful and strategically sound. In this article I’ve listed the five hints that I’ve long considered to be the ones that are of most help and least risk. I have mentioned a couple of other hints in passing, and know that there are a couple of hints in the newer versions of Oracle that should eventually be added to the list; but the five I’ve mentioned give a sound basis to work from in understanding the benefits of using hints that shape the optimizer’s strategy for a query without trying to micro-manage it.

November 30, 2015


Filed under: Oracle,Troubleshooting,Uncategorized — Jonathan Lewis @ 6:03 am BST Nov 30,2015

This is the text of the “whitepaper” I submitted to DOAG for my presentation on “Core Strategies for Troubleshooting”.


In an ideal world, everyone who had to handle performance problems would have access to ASH and the AWR through a graphic interface – but even with these tools you still have to pick the right approach, recognise the correct targets, and acquire information at the boundary that tells you why you have a performance problem and the ways in which you should be addressing it.

There are only three ways you can waste resources on an Oracle system, and only three different types of activity that need to be investigated. If you don’t appreciate that this is the case then you can waste a lot of time following the wrong strategy and attempting to solve the wrong problems. Once you have a clear statement of what you are trying to achieve it becomes much easier to achieve those aims.

Three ways to cause problems

There really are only three symptoms you need to look out for in the database

  • You’re doing it the hard way
  • You’re doing it too often
  • You’re queueing / locking

Inevitably you can see that there is going to be some overlap between the three (and it would be easy to argue that the third is simply a side effect of the other two). If you are executing a single statement “the hard way” you’re likely to be doing single block reads or buffer gets too often. If you’re executing a very lightweight statement too often it’s probably a step in a process that is attempting get a job done “the hard way”, and it’s probably causing (and getting stuck in) queues for latches and mutexes. I’ve included queueing as the third option because the simple action of locking data (deliberately, or by accident) doesn’t fall neatly into the other two.

Another way of looking at this list is to reduce it to just two items with a slightly different flavour: when you spend too much time on a task it’s because you’re either doing too much work or you’re not being allowed to work.

Three classes of problems to solve

An important aspect of performance problems is the high-level classification. Labelling the class of problem properly points you to the correct strategy for investigating the problem. Again there are only three possibilities in the list, which I’ll express as typical complaints:

  • My report is taking too long to run / screen is taking to long to refresh
  • The batch job over-ran the SLA last night
  • The “system” is slow

What’s the most significant difference between three classes of complaint ?

  • “My XXX takes too long”: XXX is a repeatable event that can be isolated and watched – just do it again, and again, and again, and again while I watch every step of what’s going on.
  • The batch job over-ran last night: It’s not a repeatable process, so you’ve got to infer what the problem was from historical evidence; it’s (usually) made up of a number of concurrent processes, which may interfere with each other to varying degrees depending on when their relative start and finish times were.
  • The “system” is slow: possibly all the time, possibly intermittently – if there’s no specific complaint then the only option is to keep an eye open for resource-intensive activity to see if you can reduce the resource usage of specific individual tasks (for which read SQL or PL/SQL statements) or reduce the number of times that those tasks are executed.

The common source

Despite the fact that we can classify performance problems in three ways, it’s worth remembering that ALL the information we might use to drive our diagnosis comes from one place – the Oracle database. At any moment we have sessions that are active, operating statements (cursors), and using a resource such as a file, a latch, a buffer, and so on. We could almost represent each moment by a cube with sessions along one side, cursors along another, and resources along the third – the cube changes moment by moment, allowing us to visualise time as the fourth dimension in a hypercube of activity.

Instant by instant the Oracle kernel code knows which session is using what resource to operate which cursor and, although the total volume of all that information is far more than could reasonably be recorded, Oracle has many different ways of slicing, dicing and capturing parts of that hypercube – running totals, rolling aggregates, snapshots by session, by cursor, by resource and so on – that are made visible as the dynamic performance views (v$ objects). Trouble-shooting is largely a case of deciding which dynamic performance views are the most appropriate to use for our three classes of task.

Active Session History

Before reviewing the three classes, it’s worth breaking off for a moment to say a few things about one of the most important and useful views that we have of the wealth of information available; this is the active session history (v$active_session_history / ASH) which Oracle uses to capture a snapshot once every second of what each currently active session is doing; every 10th snapshot is then echoed down into the Automatic Workload Repository (AWR) by a process that runs every hour to copy the dynamic view to a table that can be accessed through the database view dba_hist_active_sess_history.

The capture rates can all be adjusted: I have never seen anyone change from one snapshot per second, or every 10th snapshot in a production system, but I’ve often seen the dump to the AWR taken every 30 minutes, occasionally 20 or even 15 minutes. On occasion I have asked clients to do a CTAS (create table as select – nologging) to capture the whole of the v$active_session_history to a table that can be exported and reviewed at a later date. Oracle tries to keep a minimum of at least an hour’s ASH in memory but, with a large enough SGA, you may find that this will stretch out to 3 or 4 hours.

Unfortunately although (or perhaps because) ASH and its AWR history are extremely helpful, you have to pay extra licence fees to use the information, and the technology can only be licensed with the Enterprise Edition of Oracle. (Which is why there are some ASH emulators on the Internet)

My report is slow

The special feature of someone complaining about a specific task is that it’s likely to be repeatable – so we can run it again and again and watch every single detail to see where the time goes. Our slice through the hypercube could take a single session over a period of time and report every action along that path. This, of course, is the 10046 – a.k.a extended SQL trace event. We can enable it in many ways, perhaps through a logon trigger, perhaps through a call to dbms_monitor:

                session_id => &m_sid, 
                serial_num => &m_serial,
                waits      => true, 
                bind       => true, 
                plan_stat  => 'all_executions'

In this example I’ve request all wait states and bind variable to be dumped into the trace file, I’ve also requested that the execution plan (with rowsource execution stats) be dumped for every single execution of every single statement. Sometimes a problem arises because a particular set of bind variables represents a special case that causes a “reasonable” plan to behave very badly. If we’re going to look closely we may as well get as much detail as possible.

The entire “trace” interface was upgraded dramatically in 11g, and one of the useful variants on this theme is particularly relevant to a commonly used Web-based implementation. If you know that a specific screen task corresponds to a particular PL/SQL package you can enable tracing of a cursor (across the system, if necessary) by SQL_ID. So, for example, you might issue the following two commands, with a couple of minutes gap between the two:

alter system
        set events '

-- wait a few minutes

alter system
        set events 'sql_trace[SQL:1wthpj7as7urp] off'

Every time the statement with SQL_ID =‘1wthpj7as7urp’ is executed, the session executing it will start adding information to the session trace file, and when the statement ends the tracing will end. This is particularly nice if the “statement” is a top-level call to a PL/SQL procedure because all the SQL inside the procedure will be traced as the package executes.

For a highly focused, highly repeatable task, the 10046 trace event is almost always all you need to do.

The batch over-ran

The big difference between this case and the previous one is that “the batch” is not something you can simply repeat and watch. Moreover, “the batch” is likely to be a large number of separate sections of code that are scheduled to run with a fairly fluid timetable that can result in changes from day to day (or, more likely, night to night) in the set of jobs that might be running concurrently. This means that even if you could re-run the batch job (perhaps on the previous night’s backup) you might not see the same problem appear because a small change in timing could result in a large change in contention).

One of the most important steps of dealing with the batch is pre-emptive: instrument your code and make it possible to compare the run on one night with the run on another. At the very least you need to have something capturing the start and end times of each “significant component” of the batch so you can quickly answer questions like: “which jobs took much longer than usual”, “which job was the first job that took longer than usual”, “which jobs were running concurrently with job X last night when they never usually overlap?”

Ideally you should have much more information than this about each job – basically a report from Oracle which says “how much work did I do, how much time did I spend”: for a session this is simply a report of v$mystat or v$sesstat (joined to v$statname) and v$session_event for the session (v$mystat is a less well- known view that is v$sesstat restricted to “my” session) ; if you classify each job as “connect to the database, do something, disconnect” then this critical log is simply a pair of select statements spooled out somewhere convenient, or written to the database; if you want to break a single connection into significant pieces then a simple pl/sql procedure could read the statistics into a pl/sql array as the piece starts, then re-read the stats and calculate the differences as the piece ends.

Knowing where the time went, and knowing how tasks have behaved differently from previous runs is a big step forward to identifying the problem.

If you don’t have the instrumentation you need then the AWR (if you’re licensed) or Statspack (if you’re not licensed) is a step in the right direction. Apart from the typical hourly snapshots and reports you can take a snapshot as the first and last steps of the batch so that you’ve got “the whole batch” in a single AWR/Statspack report. If you’ve got that you can then do comparisons for things like:

  • Which event consumed much more time than usual
  • Which SQL took much more time than usual
  • Which segment(s) saw much more activity than usual
  • Was there some unusual enqueue activity
  • Can we see some unusual outliers in the event histograms
  • Can we see some unusual memory demands in the pga histogram

Although system-wide summaries rarely point us at exact causes, they can often give us strong clues of areas (and times) where problem originated.

In this respect the “Top Activity” screen from Enterprise Manager (Grid Control / Cloud Control) can be very helpful as it produces a nice graphical presentation of “working”. Where, in the picture of last night’s activity, does the graph start to ramp up, and what colour is the bit that’s growing, and how does that picture compare to the same picture the previous night. (Having two windows open with two different nights makes it wonderfully easy to switch between displays and spot the differences.) Since the “historical” top activity screen is created from the dba_hist_active_sess_history, which contains about 100 different details per session of each captured moment, it’s very easy to drill though the spikes to answer questions like: “which object”, “what event”, “which SQL”, “what was the execution plan”, “how much work did that take”, to follow the chain of time back to the cause.

The system is slow

If no-one is going to tell you about specific tasks, and if you don’t have any sort of boundary that allows you to focus on tasks or time-ranges, then the simplest thing to do is look for anything expensive (i.e. time-consuming) and see if you can make it cheaper.

Again, the graphic “Top Activity” screen is very helpful, and I often tell people to arrange to have a system that shows the top activity screens for the most important 2 or 3 databases on a large screen on the wall where any passing DBA might notice a brief spike in workload. There are systems that can be improved by constant monitoring – so long as the monitoring doesn’t take out 100% of an individual’s time but is driven as an informal glance at a picture.

If you’re not licensed to take advantage of the AWR then Statspack can help – but with the smallest time interval (though 15 minutes is as low as I’ve ever gone) between snapshots so that “anomlies” that are short-lived don’t fall out of memory before they can be captured.

An important feature of reading Statspack is that you need to check for missing information – if the headline figure for physical reads is 25M but the “SQL ordered by reads” is 12M then you know that there must be 13M reads that didn’t get captured in the report and that might be the 13M that is causing the problem. Similarly if the “Segments by physical reads” reports 16M reads that’s 4M more than the SQL – but is the 12M a subset of the 16M, or is there only a 3M overlap between the two figures so that between them the 12M and 16M cover the entire 25M. There’s more information in the Statspack report than immediately meets the eye, and a careful drilldown into other areas of the report (typically the Instance Activity) may be needed to clarify the meaning of what you’re seeing as a headline figure.

The other thing you can do with “the slow system” when you don’t have ASH to help is take snapshots (or get some freeware to do the same sort of thing). If the system is slow “right now” you could, for example, take a snapshot of v$sess_io (session I/O), wait 30 seconds then take another snapshot, find the difference and see who is doing most of the I/O work – then chase that session; or take snapshots of v$sesstat limited to (say) statistics like “%redo%” and find out who is generating lots of redo.

Oracle allows you to take this approach back into recent history – there are a number of “metric” views which give you thing like the rolling average, min, and max I/O volumes for the last few intervals of 5 minutes or 1 minute each – telling you, in other words, whether there were any interesting bursts of extreme activity in the recent past. For example, a query against v$sysmetric_summary might give you an output like the following:

------------------------ ------------- ----------- ------------ -----------------------
Physical Reads Per Sec        1,618.95      105.92       358.16 Reads Per Second
Physical Reads Per Txn       97,202.00    5,539.19    20,811.56 Reads Per Txn
Redo Generated Per Sec    6,773,108.94  218,132.86 1,023,458.57 Bytes Per Second
User Calls Per Txn              395.00       43.39        79.85 Calls Per Txn
Total Parse Count Per Sec        31.14        1.88         4.25 Parses Per Second
Host CPU Utilization (%)         64.51        3.93         9.07 % Busy/(Idle+Busy)
Database Time Per Sec            82.96        6.65        15.37 CentiSeconds Per Second
I/O Megabytes per Second         35.58        2.62         5.73 Megabtyes per Second

This summarises the last 12 intervals of 5 minutes. If we look at “Physical Reads per Txn” we can see that there were some extreme swings in activity over that period, so we could drill down into v$sysmetric_history for “Physical Reads per txn”, looking at the 1 minute granularity and see:

METRIC_UNIT                    BEGIN_TIME                VALUE
Physical Reads Per Txn         05-feb 12:45:55          421.00
                               05-feb 12:44:55          477.00
                               05-feb 12:43:55          351.00
                               05-feb 12:42:55          406.84
                               05-feb 12:41:55        1,550.00
                               05-feb 12:40:55       93,984.00
                               05-feb 12:39:55       97,202.00
                               05-feb 12:38:55       93,323.00
                               05-feb 12:37:55          391.00
                               05-feb 12:36:55          504.00
                               05-feb 12:35:55          504.00
                               05-feb 12:34:55          252.00

Yes, a few minutes ago something exploded onto the system doing a huge amount of I/O for about 3 minutes. If we’re lucky we might now drill into the v$sesstat, or v$sess_io, or v$session_event to see if we can find a session that is responsible for a large amount of I/O; and then check v$open_cursor to see if it still has some open cursors that might (if we check v$sql) show us what caused the I/O.

When there are no specific complaints, we just keep an eye open for spikes in activity and try track them down as quickly and cheaply as possible to see if they’re worth addressing.


Oracle gives you a huge amount of information about the work that’s going on and the time that’s being used in the database. Unfortunately the most useful repository of that information is in a dynamic performance view that can only be viewed in the Enterprise Edition after purchasing additional licences. However, the information is summarised, in many different ways in literally hundreds of other dynamic performance views, and it’s easy to pick out helpful information from those views in a variety of ways.

Key to making the best use of those views, though, is recognising that different classes of performance problems require different strategies – and there are only three different classes of problems to worry about.

November 25, 2015

Tech 15

Filed under: Uncategorized — Jonathan Lewis @ 1:00 pm BST Nov 25,2015

Updated 27th Nov 2015:

Thanks for all the questions so far – there are plenty of useful topics coming out. At this rate I may have to pass some of these on to the round-table that Tony Hasler.

I’ve made a couple of comments in response, but generally I plan to avoid making any comments until after the panel.

Monday 7th Dec: 11:20 – 12:05

I’ve arranged a panel session on the Cost Based Optimizer for UKOUG Tech 2015, with Maria Colgan, Nigel Bayliss, and Chris Antognini joining me to answer (and maybe argue about) questions from the audience.

To keep things moving along, we aim to have a few questions available before the day and collect a few questions at the door as well as accepting questions from the floor. Martin Widlake and Neil Chandler will be acting as MCs, wielding the microphones, and collecting questions at the door.

So far we have a handful of questions – if you have a question you’d like to raise, add it to the comments below. Even if you can’t attend we may get to your question, and I may be able to write up a summary of questions and answers after the event.

Questions so far:

  • Why do queries that have been working fine for months suddenly change to a very slow plan ?
  • Are smaller, simpler queries less likely to have a bad plan or change from a good plan to a bad one ?
  • What is the down-side of using SQL Profiles to lock in plans ?
  • What’s the best way to collect stats on a partitioned table where you add a daily / weekly / monthly partition ?
  • Is there a good way to make sure you’ve collected stats for interval partitions when you don’t know they’ve been added ?
  • Why do the manuals talk about “prefixed” local indexes (on partitioned tables) as if they have some special effect ?
  • Why does a query with literals run faster than one with bind variables? Stats are up to date on all tables concerned.
  • For a single table an SQL Plan directive will create extended statistics, but how does the Optimizer resolve SPD for the group by or join queries ?
  • I’ve heard that for the group by and join queries, SPD always forces dynamic sampling, is this true ?
  • Will SPD with dynamic sampling take precedence over e.g. columns with histograms?
  • What is the order of precedence the optimizer uses when resolving table cardinality?
  • Will 12.2 allow us to combine column groups and expression as a single extended statistic?

May 8, 2015


Filed under: Uncategorized — Jonathan Lewis @ 8:21 am BST May 8,2015

I’ve just been checking “Cost Based Oracle – Fundamentals” (Apress 2005) to see what I said on a particular topic, and I couldn’t resist quoting the following from the opening page of Chapter 1:

One of the commonest questions about the CBO on the Internet is: “What does the cost represent?” This is usually followed by comments like: “According to explain plan the cost of doing a hash join for this query is seven million and the cost of a nested loop is forty-two – but the hash join completes in three seconds and the nested loop takes 14 hours.”

The answer is simple: the cost represents (and has always represented) the optimizer’s best estimate of the time it will take to execute the statement. But how can this be true when people can see oddities like the hash join / nested loop join example above? The answer can usually be found in that good old acronym GIGO: Garbage In, Garbage Out.

The CBO makes errors for six main reasons:

  • There are some inappropriate assumptions built into the cost model.
  • The relevant statistics about the data distribution are available, but misleading
  • The relevant statistics about the data distribution are not available
  • The performance characteristics of the hardware are not known
  • The current workload is not known
  • There are bugs in the code

Still true – although there are more features and cunning bits where inappropriate assumptions and bugs can appear.



January 12, 2015

FBI Bug reprise

Filed under: Execution plans,Oracle,Troubleshooting,Uncategorized — Jonathan Lewis @ 12:52 pm BST Jan 12,2015

I’ve just had cause to resurrect a blog note I wrote three years ago. The note says that an anomaly I discovered in wasfixed in – and this is true for the simple example in the posting; but a recent question on the OTN database forum has shown that the bug still appears in more complex cases.  Here’s some code to create a table and two indexes:

create table t1
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
	rownum			id,
	cast(dbms_random.string('U',2) as char(2))	c1,
	cast(dbms_random.string('U',2) as char(2))	c2,
	cast(dbms_random.string('U',2) as char(2))	c3,
	cast(dbms_random.string('U',2) as char(2))	c4,
	rpad('x',100)		padding
	generator	v1,
	generator	v2
	rownum <= 1e5

		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'

create index t1_iasc  on t1(c1, c2,      c3, c4) nologging;
create index t1_idesc on t1(c1, c2 desc, c3, c4) nologging;

I’ve designed the table to model the way a problem was presented on OTN, it’s possible that the anomaly would appear in simpler circumstance; note that I have two indexes on the same four columns, but the second column of one of the indexes is declared as descending. To identify the indexes easily in execution plans the latter index has the text “desc” in its name. So here’s a query, with execution plan, that should use one of these indexes:

from t1
        (C1 = 'DE' and C2 >  'AB')
or      (C1 = 'DE' and C2 >= 'AB' and C3 > 'AA' )
or      (C1 = 'DE' and C2 >= 'AB' and C3 >= 'AA' and C4 >= 'BB')
order by
        C1, C2, C3, C4

| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT            |         |       |       |     4 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |    21 |  2478 |     4  (25)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_IASC |    21 |       |     3  (34)| 00:00:01 |

Predicate Information (identified by operation id):
   2 - access("C1"='DE')
       filter(((SYS_OP_DESCEND("C2")<SYS_OP_DESCEND('AB') AND "C2">'AB') OR
              (SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C3">'AA' AND "C2">='AB') OR
              (SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C4">='BB' AND "C2">='AB' AND

Sure enough – the query has used the t1_iasc index – but why has the optimizer introduced all those predicates with the sys_op_descend() function calls in them when we’re not using an index with a descending column ? Somewhere in the code path the optimizer has picked up the other index, and decided to use some information from it that is totally redundant. One side effect of this is that the cardinality prediction is 21 – if I drop the index t1_idesc the sys_op_descend() calls disappear and the cardinality changes to 148.

Oracle 12c behaves differently – it uses concatenation to turn the query into three separate index probes unless I add the hint /*+ first_rows */ (which I shouldn’t be using, but it’s the hint supplied by the OP on OTN). With the hint in place we get an example of the actual execution plan differing from the prediction made through explain plan:

12c execution plan unhinted (concatenation, and lots of sys_op_descend()):

| Id  | Operation                             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT                      |          |       |       |    12 (100)|          |
|   1 |  SORT ORDER BY                        |          |   149 | 17582 |    12   (9)| 00:00:01 |
|   2 |   CONCATENATION                       |          |       |       |            |          |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |   142 | 16756 |     4   (0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN                  | T1_IDESC |     1 |       |     2   (0)| 00:00:01 |
|   5 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |     6 |   708 |     4   (0)| 00:00:01 |
|*  6 |     INDEX RANGE SCAN                  | T1_IDESC |     1 |       |     2   (0)| 00:00:01 |
|   7 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1       |     1 |   118 |     3   (0)| 00:00:01 |
|*  8 |     INDEX RANGE SCAN                  | T1_IASC  |     1 |       |     2   (0)| 00:00:01 |

Predicate Information (identified by operation id):
   4 - access("C1"='DE' AND "C3">='AA' AND "C4">='BB' AND
       filter("C4">='BB' AND SYS_OP_UNDESCEND(SYS_OP_DESCEND("C2"))>='AB' AND "C3">='AA')
   6 - access("C1"='DE' AND "C3">'AA' AND SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB'))
       filter("C3">'AA' AND SYS_OP_UNDESCEND(SYS_OP_DESCEND("C2"))>='AB' AND
              (LNNVL("C4">='BB') OR LNNVL(SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB')) OR
              LNNVL(SYS_OP_UNDESCEND(SYS_OP_DESCEND("C2"))>='AB') OR LNNVL("C3">='AA')))
   8 - access("C1"='DE' AND "C2">'AB' AND "C2" IS NOT NULL)
              LNNVL("C2">='AB')) AND (LNNVL("C4">='BB') OR

12c Execution plan with first_rows hint (and the sys_op_descend have gone)
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT            |         |       |       |   150 (100)|          |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |   148 | 17464 |   150   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_IASC |   148 |       |     2   (0)| 00:00:01 |

Predicate Information (identified by operation id):
   2 - access("C1"='DE')
       filter(("C2">'AB' OR ("C3">'AA' AND "C2">='AB') OR ("C4">='BB' AND
              "C2">='AB' AND "C3">='AA')))

12c Execution plan with first_rows according to Explain Plan (and they're back again)
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT            |         |   148 | 17464 |   150   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T1      |   148 | 17464 |   150   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | T1_IASC |   148 |       |     2   (0)| 00:00:01 |

Predicate Information (identified by operation id):
   2 - access("C1"='DE')
       filter("C2">'AB' AND SYS_OP_DESCEND("C2")<SYS_OP_DESCEND('AB') OR
              "C3">'AA' AND SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C2">='AB' OR
              "C4">='BB' AND SYS_OP_DESCEND("C2")<=SYS_OP_DESCEND('AB') AND "C2">='AB' AND

On the plus side the upgrade to 12c has removed some of the sys_op_descend() appearances and seems to have fixed the cardinality anomaly when the sys_op_descend() calls do appear – but things are still going wrong in odd ways. (And this looks like another case where you could drop an index that you’re not using and seeing an execution plan change.)


When playing around with 12c, I had to be very careful to avoid confusing the issue as SQL Plan Directives were appearing from time to time, and giving me unexpected results as I repeated some of the experiments.

Footnote 2:

I haven’t actually addressed the question originally posed on OTN, but that can wait for another day.

Blog at


Get every new post delivered to your Inbox.

Join 6,669 other followers