Oracle Scratchpad

October 30, 2018

Index Splits – 2

Filed under: Indexing,Infrastructure,Oracle — Jonathan Lewis @ 1:29 pm GMT Oct 30,2018

In yesterday’s article I described the mechanism that Oracle for an index leaf block split when you try to insert a new entry into a leaf block that is already full, and I demonstrated that the “50-50” split and the “90-10” split work in the same way, namely:

  • save the old block into the undo
  • prepare a new leaf block
  • “share” the data between the old and new leaf blocks
  • sort out pointers

The obvious question to ask about this process is: “Why does Oracle save and rewrite the whole of the old leaf block during a 90-10 split when the data in the block doesn’t appear to change ?” The “sharing” in the 90-10 split is most uneven, and it appears that Oracle simply attaches a new leaf block to the index structure and writes the new index entry into it, leaving the existing index entries unchanged in the current leaf block.

The answer to that question can be found by doing block dumps – except you won’t see the answer if you use my original test data. So here’s a follow-on script to the previous test (written 11 years after the previous script):


rem
rem     Script:         index_splits3a.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2018
rem     Purpose:
rem

drop table t2 purge;

create table t2 as select * from t1 where id <= 148;
alter table t2 add constraint t2_pk primary key(id, idx_pad) using index pctfree 0;

column object_id new_value m_index_id
select object_id from user_objects where object_name = 'T2_PK' and object_type = 'INDEX';

begin
        for r in (select * from t1 where id between 149 and 292 order by dbms_random.value) loop
                insert into t2 values(r.id, r.idx_pad, r.padding);
        end loop;
        commit;
end;
/

alter session set events 'immediate trace name treedump level &m_index_id';
alter system flush buffer_cache;

prompt  check the trace file for the last block of the index
prompt  then do a block dump for it.
prompt  then press return

pause

insert into t2 values(293, rpad('x',40,'x'), rpad('y',50,'y'));
commit;

alter session set events 'immediate trace name treedump level &m_index_id';
alter system flush buffer_cache;


This test depends on the number of rows used for the previous test – and I have four hard-coded values (148, 149, 292, 293) in it that matter. If you’ve had to use a different number of rows in your version of the first test you will need to adjust these values to match.

I’ve created a clone of the t1 table copying only the first 148 rows – this is just enough rows that when I create a unique (PK) index on the table the index will have two leaf blocks, the first holding 147 entries and the second holding one entry. I’ve then inserted the next 144 rows from t1 into t2 in random order, so that I end up with two full leaf blocks.

Once the data is ready the code issues a treedump command (so that we can check the index is as I’ve described it) and flushes the buffer_cache, then prompts you with some instructions and waits for you to press return. At this point you need some manual intervention from another session – you can examine the treedump to work out the file and block addresses of the two leaf blocks and dump the second leaf block (‘alter database dump datafile N block M;’).

After you’ve done the block dump press return and my code resumes and inserts a new row that will cause a 90-10 split to happen, then it does another treedump (to let you check the block addresses and see that the split was 90-10), and flushes the buffer cache again. This is where you can check the block address of the second leaf block (in case it has managed to change – which it shouldn’t) and dump the block again.

Here, with a huge chunk removed from the middle, are the results I got from searching for the expression “row#” in the two block dumps that I generated in my test.


Before 90/10 block split:
-------------------------
row#0[7979] flag: -------, lock: 0, len=53, data:(6):  01 40 01 84 00 03
row#1[1885] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 2b
row#2[1938] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 2a
row#3[5595] flag: -------, lock: 2, len=53, data:(6):  01 40 01 f9 00 2c
row#4[3581] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 0b
...
row#142[1408] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 34
row#143[2150] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 26
row#144[878] flag: -------, lock: 2, len=53, data:(6):  01 40 01 fd 00 3e


After 90/10 block split
-----------------------
row#0[348] flag: -------, lock: 0, len=53, data:(6):  01 40 01 84 00 03
row#1[401] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 2b
row#2[454] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 2a
row#3[507] flag: -------, lock: 0, len=53, data:(6):  01 40 01 f9 00 2c
row#4[560] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 0b
...
row#142[7873] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 34
row#143[7926] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 26
row#144[7979] flag: -------, lock: 0, len=53, data:(6):  01 40 01 fd 00 3e

The “row#” is in ascending order – these lines in an index leaf block dump show Oracle walking through the block’s “row directory”; the number in square brackets following the row number is the offset into the block where the corresponding index entry will be found. When Oracle inserts an index entry into a leaf block it adjusts the row directory to make a gap in the right place so that walking the directory in row# order allows Oracle to jump around the block and find the index entries in key order.

When Oracle rewrites the block it first sorts the index entries into key order so that the actual index entries are written into the block in key order and a range scan that moves a pointer smoothly through the row directory will be moving another pointer smoothly down the block rather than making the pointer jump all over the place. Presumably this has (or maybe had) a benefit as far as the CPU cache and cache lines are concerned.

So there is a method in the madness of “copy the whole block even when the content doesn’t change”. The content doesn’t change but the order does, and paying the cost of sorting once may return a benefit in efficiency many times in the future.

 

15 Comments »

  1. Hi,

    Makes sense. So an index block is like a tiny heap table with an index of its own and Oracle uses the opportunity of the split to improve the clustering factor ;).

    Although memcopy being fairly cheap (and needs to be done on the smaller directory anyway) Oracle could keep the block entries sorted with each insert too.

    Also : I always assumed that block reorganizing is done at the drop of a hat and doesn’t need to be protected with undo since it doesn’t change the content? So Oracle could conceivably rewrite the block (10.8) without saving the unsorted version (10.9). But only for the 90:10 (100:0) version I guess so there’s that.

    I would still like to know what the comments for

    10.9 (kdxair) apply XAT do to ITL 1 — related to leaf block split, Apply Itl Redo – KDICAIR, KTB-R, sets a flag (-B–) in the ITL of a current leaf block)

    mean or why the undo (on Julians page) has two ITL-entries.

    Your first dump above shows locks on the rows. Is that evidence of delayed block cleanout (not necessarily related to these phenomena)? If so it would be interesting how it managed to survive with such tiny DMLs. Maybe the old “visit up to 10% of the SGA for immediate cleanup” has been downgraded to 0% a long while ago? It couldn’t have been meaningful since SGAs reached double digit GBs and didn’t seem to be adjustable.

    The first row is not locked as that was from the first 147+1 insert. That is 148 (smaller than all the random inserts) yet it has a high offset? I vaguely remember that free inserts fill the block from top to bottom. After the reorg its bottom to top.
    Also maybe by doing this with the 50:50 split you can find a hint for why it isn’t even (78 / 68)? Maybe if 90:10 is really 100:0 than 50:50 is really 55:45 (but why should it)?

    regards,

    Comment by Racer I. — November 2, 2018 @ 7:30 am GMT Nov 2,2018 | Reply

    • Racer I.

      Bear in mind that all I’m going on is what happens when I do various experiments – I don’t have any sources of information that tell me that I’m right (or wrong).
      The thing about memcpy and keeping the index entries in order is that you couldn’t just memcpy for the index entries, you would have to update every row directory entry below the insertion point to make the directory consistent with the memcpy’d entries.

      I’m strongly inclined to agree with you about there being no need to copy the whole block for the 90-10 split – after all, when Oracle has to tidy out deleted entries from an index leaf to make room for a new insertion it sorts the remaining rows without writing the unsorted block to redo; although there may be some reason that appears if the sys-recursive transaction to do the split has to rollback. (Maybe it’s simply to minimise risk by re-using a single code path). I know I convinced myself at some point that it was necessary, but right now I can’t think why.

      Here’s the whole 10.9 change vector from 11.2.0.4:

      
      CHANGE #3 TYP:0 CLS:1 AFN:5 DBA:0x0140008f OBJ:350329 SCN:0x0b86.0cffcb89 SEQ:2 OP:10.9 ENC:0 RBL:0
      index redo (kdxair): apply xat do to itl 1 (count=2)
      KTB Redo
      op: 0x05  ver: 0x01
      compat bit: 4 (post-11) padding: 1
      op: R  itc: 1
       Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
      0x01   0x001c.002.00009cf6  0x00c1675a.1d23.01  -B--    1  fsc 0x0000.00000000
      
      

      There is only one ITL changed, and it’s the index blocks “service ITL” – i.e. the one that isn’t about user data and exists only for handling sys-recursive transactions. Possibly the (count=2) is simply telling us how many ITL slots currently exist in the block.

      Your question about the “lock: 2” in the dump has brought to my attention something I hadn’t noticed. I’ve got a commit inside the PL/SQL loop and it’s being ignored until the anonymous block ends. If I raise an application error partway through the loop none of the rows previously inserted and “commit”ed in the loop are actually visible. This means all the rows inserted in the loop are inserted under the same transaction ID. (The session stats for the loop show one “user commits” and only a handful of “enqueue requests”.) My error – what this should have told me was that I’d taken the results from a test where the commit was outside the loop.

      The fact that the locks are all there in the pre-dump is a mark of Oracle being constructively lazy. In this example Oracle has done a couple of “commit cleanouts”, which means the session has updated the ITL entry but hasn’t bothered to set all the row lock bytes back to zero – some other session will do that if they need to.

      The basic pattern (though direct path / ctas etc. are different) is to heap the data from the bottom of the block upwards while growing the directory downwards until the meet in the middle. So if we insert the data in order one row at a time the offsets would be in descending order.

      I’m sure that with enough spent time on carefully crafted experiments it would be possible to find out what the 50/50 algorithm was – but I don’t want to spend that time. One detail that Steve Adams noted many years ago was that Oracle had some leeway in picking the split point to minimise the size of the entry in the branch pointing at the leaf block – which would add to the complication of trying to work out what 50/50 really meant.

      Comment by Jonathan Lewis — November 2, 2018 @ 9:41 am GMT Nov 2,2018 | Reply

  2. Hi,

    Such a rabbit hole :)

    Fair point about keeping the row directory pointers updated.

    New knowledge for me : sys-internal / service ITLs.

    With the two ITL-entries I meant the undo (not redo) vector from Julian :

    CHANGE #2 TYP:1 CLS:34 AFN:2 DBA:0x008002fd OBJ:4294967295 SCN:0x0000.00181068 SEQ:  1 OP:5.1
    ktudb redo: siz: 8148 spc: 8040 flg: 0x000a seq: 0x0193 rec: 0x01
                xid:  0x0009.020.000001e1
    ktubu redo: slt: 32 rci: 0 opc: 10.21 objn: 53071 objd: 53071 tsn: 4
    Undo type:  Regular undo       Undo type:  Last buffer split:  No
    Tablespace Undo:  No
                 0x008002fc
    index general undo (branch) operations
    KTB Redo
    op: 0x05  ver: 0x01
    op: R  itc: 2
     Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
    0x01   0x0009.020.000001e1  0x008002fc.0193.01  ----    1  fsc 0x0000.00000000
    0x02   0x0009.000.000001e2  0x008002fb.0193.01  ----  363  fsc 0x0000.00000000
    Dump kdige : block dba :0x01001bbc, seghdr dba: 0x01001bbb
    restore block before image
    

    This undo should undo whatever 10.9 does. But it seems to be the undo for the later 10.8-reordering/overwriting. It seems to involve recording each entry separately or maybe Oracle just records that there were 363 rows. Presumably if this is rolled back it would recreate that block sorted not unsorted. Testing the rollback-results in the blocks would probably be interesting anyway.

    About the lock:2 : So Oracle does partial cleanouts (just the ITL). Presumably pointing to a cleaned out ITL-slot says : can be ignored/cleaned safely later.
    What created the second ITL slot here? I.e. what was the first?

    About the optimized commits : Is that an anonymous block feature or general for PL/SQL? What if I want to keep the successfully “processed” rows upon the error? Isn’t that the main reason for slow-by-slow (if you haven’t heard about or want to deal with SAVE EXCEPTIONS or ERROR logging tables)?

    regards,

    Comment by Racer I. — November 2, 2018 @ 11:54 am GMT Nov 2,2018 | Reply

    • Racer I.

      Most important point first – I made a mistake with the “commits being ignored”, which I’ve now highlighted my previous comment. I’d done some tests with the commit outside the loop and must have copied the wrong trace file, and didn’t notice the coding change when I re-ran the test to check. Sorry.

      I should also have noticed that you said the undo block from Julian’s website. His examples are from 10.2.0.x so it’s always possible (though rare, given the critical nature of redo) that there are some minor variations in output – but as you picked up from another blog post, the first ITL entry is always about block splits, the second one is the first end-user ITL

      There are so many threads to follow in that it’s hard to keep track of all the questions and answers. Two things to note: the 10.9 vector has only one ITL entry, which means its the undo for the root block (or, generically, a branch block – a branch doesn’t need “user” itl slots), the undo is there in case the split fails. The other thing to note is that there appear to be two transactions in this chain of redo: OP: 5.2 is a “modify undo header”, OP:5.4 is an “end transaction”, and we see two 5.4s.The first transactions operates the split, the second inserts the spare row.

      Comment by Jonathan Lewis — November 2, 2018 @ 8:28 pm GMT Nov 2,2018 | Reply

  3. Hi,

    https://jonathanlewis.wordpress.com/2009/07/23/index-quiz-2/

    Here we find two useful statements :
    – Initrans is ignored for indexes (because with splitting its never an issue)
    – Index blocks always keep one ITL slot reserved for splitting.

    Its presumably the update from –U- (unused/reserved?) to -B– (in use for splitting?) what 10.9 does. And explains why there’s always 2 slots and why the locks are in slot 2 (user ITLs start with 2). I also assume there can/will never be more than one service (trans-)action per block, so one reserved slot suffices.

    regards,

    Comment by Racer I. — November 2, 2018 @ 12:36 pm GMT Nov 2,2018 | Reply

    • See previous reply:

      The 10.9 is the undo redo for the update to the root block. Otherwise your observations are basically correct. The B, combined with other flags, tells Oracle how to recognise which rows were committed or not at the time of the leaf block split – but off-hand I don’t remember the exact mechanisms.

      Comment by Jonathan Lewis — November 2, 2018 @ 8:31 pm GMT Nov 2,2018 | Reply

  4. Hi,

    Upgrade the rabbit hole to the kola hole…

    regards,

    Comment by r — November 2, 2018 @ 12:42 pm GMT Nov 2,2018 | Reply

  5. Hi,

    Oops. That last comment was meant to go on :

    If many transactions have the block in their grip and then one needs to split, the 10.9 effort may come into its own.

    regards,

    Comment by Racer I. — November 2, 2018 @ 12:44 pm GMT Nov 2,2018 | Reply

  6. Hi,

    > The 10.9 is the undo for the update to the root block.

    That would really be unexpected. The root block only has two entries, why would the undo be 8k?
    For the 90:10 case I reformatted your 3rd record like this for me :

    3. apply XAT/ITL/REDO(?) -> full block Undo-Redo (expensive)
    3.1 0x00c00160 5.2 Get undo header
    3.2 0x00c04660 5.1 Update undo block (2., too big for the first?)
    3.3 0x0140008f 10.9 (kdxair) apply XAT do to ITL 1 — related to leaf block split, Apply Itl Redo – KDICAIR

    So 3.3. (the 10.9 redo) goes to 8f (the leaf block being split). I guessed (can’t see it directly) that 3.2 (5.1) is the big (8k) part and looks like Julians (i.e. undo for 3.3 AND the full pre-image of block 8f). What I hadn’t noticed before in Julians undo-example is that the two ITL-slots there are for different blocks (2fc, 2fb) and different from the redo before (2fd). That is confusing my picture.
    3.1 (5.2) seems to be undo for adding more undo blocks to the transaction

    > a branch doesn’t need “user” itl slots

    Filed for further consideration. Belongs to the larger concept I’m trying to collect info on : index structure changes are never rolled back. However that seems to apply only to true rollbacks. Consistent read blocks are probably true reconstructed old images. The 10.9 undo would be used for rewinding past splits.

    > the undo is there in case the split fails

    Also interesting. Since the split itself (10.8) has no undo this would be quite shaky without a fail-safe.

    Still nothing so far would seem to preclude the 90:10 split to be much cheaper. Just allocate a new block, put the new row in there and update the structure.

    regards,

    Comment by Racer I. — November 5, 2018 @ 11:00 am GMT Nov 5,2018 | Reply

    • Racer I.

      > The 10.9 is the undo for the update to the root block.

      Two mistakes in one short sentence.
      The first was a brain-typo, that was supposed to be “redo”, not “undo” – redo change vectors for undo are all in layer 5.
      The second was that it’s not the root block – it was the high-value leaf, updating the service ITL.

      If you want to dig into all the details, send me an email I’ll reply with the whole log file dump for the split – I don’t have the time to pick the whole thing apart at present.
      If there’s any redo opcode missing from Julian’s article you may find them here: https://jonathanlewis.wordpress.com/2017/07/25/redo-op-codes/

      Regards
      Jonathan Lewis

      Comment by Jonathan Lewis — November 5, 2018 @ 6:56 pm GMT Nov 5,2018 | Reply

  7. Hi,

    > when Oracle has to tidy out deleted entries from an index leaf to make room for a new insertion it sorts the remaining rows without writing the unsorted block to redo;

    NB : often used below : CDE = commited deleted entry

    Is this an established fact? During the weekend I was juggling some ideas. I could imagine there is no “compact index block” step per se. Maybe the logic for an insert is :
    1. can this revive a (unique)-CDE (i.e. update the ROWID and remove the Del-state) -> no split necessary
    2. does the new entry still fit -> no split necessary
    3. split the block (50:50 inside / 90:10 at the right edge or if the new entry is too big)
    3a) the two after-split blocks are sorted which also compacts (removes CDEs)
    4. insert the new row where appropriate.

    Factors I considered :
    1. if the new row is > 4k, compacting will not help, even if 99% are CDEs
    2. predicting if compacting will help may involve iffy math
    3. 90:10 splitting will always allow the new insert (in the new empty block)
    3. constant compacting is inefficient
    4. especially if (as I assume) compacting uses/needs a full 10.9 before image as well

    Open questions :
    1. Will a 50:50 split plan for CDEs? I.e. if the split block is 50% CDEs (either a) lower half is CDE or b) every second row is CDE) will the split blocks look the same or will the left block for a) end up empty?
    2. Will oracle remove empty blocks (after they existed temporarily) or live with them? Both for the 1.a) case and if, after a 90:10 split and compacting the left block, the new row might fit there, so the new right block is unnecessary.

    https://richardfoote.wordpress.com/2015/06/23/empty-leaf-blocks-after-rollback-part-i-empty-spaces/

    Not sure how this matches up with reality but my ideas for index insert undo look like this :
    1. The undo-records are “set dictionary slot X in block Y to deleted”. So unlike delete undo (which are inserts) they don’t contain the key value.
    2. Subsequent inserts below/before the previous ones in the block shift the slot #s around. Rollbacks don’t roll back the structure. I.e. the inserted entries are not removed/dictionary shrunk but simply set to “Del”.
    3. During CR-block construction true before images (physically removing undone inserts) are created, so subsequent processes can take such a CR and roll it back further to an older SCN. If this crosses a split the before image is reconstructed from the 10.9 undo.
    4. During true rollbacks Oracle just keeps track of the slot # mappings, so the undo-slot “X” can be mapped to where it is now without physically removing each undone insert. When crossing a split the mapping structure is re-initialized with the 10.9-dictionary and subsequently updated like in the CR case. However this is then used to recover the key-value (also from the 10.9) for each undo-record which is then looked up in the new structure (via the usual B* walk) and set to “del” wherever it is now after the split. This would be similar to stale ROWID-guesses in an IOT using the key value for correction.

    A potential side effect of this could be that the 10.9 undo contains the original dictionary (including CDE slots) but not the CDE-key values. Those should never be needed. This could be tested by deleting many entries (-> CDE) before the insert-split-test. This would show if
    – is compacting without splitting done or not?
    – if it is : does it still use a 10.9 ?
    – if there is a 10.9 : is its undo smaller (no CDE-keys)?

    regards,

    Comment by Racer I. — November 5, 2018 @ 11:01 am GMT Nov 5,2018 | Reply

  8. Hi,

    > What I hadn’t noticed before in Julians undo-example is that the two ITL-slots there are for different blocks (2fc, 2fb) and different from the redo before (2fd). That is confusing my picture.

    Rewrite : all those are uba, which likely means undo block address. Of course ITLs are always for the block they are in, so don’t need to point there.
    The actual affected block is in the next row I guess :

    Dump kdige : block dba :0x01001bbc, seghdr dba: 0x01001bbb

    > 1. if the new row is > 4k, compacting will not help, even if 99% are CDEs

    Also wrong. Two 4k entries each need a block of their own but can coexist with smaller ones. So this should at least read “may not help” and folds into the iffy math topic.

    > 2. predicting if compacting will help may involve iffy math

    This might be clear (block content – sum(CDEs) may even be stored and updated in the block header) but there may then be optimization rules with thresholds when to do compacting and when to do splitting making testing harder.

    regards,

    Comment by Racer I. — November 5, 2018 @ 12:19 pm GMT Nov 5,2018 | Reply

  9. Hi,

    Weekend theories…

    https://richardfoote.wordpress.com/category/index-delete-operations/

    Any insert cleans up CDEs. And this may be the answer :

    https://orainternals.wordpress.com/tag/redo-record/

    Change #8 specifies to purge the leaf row with the key value ’06 52 69 79 61 6a 53 06 01 00 11 3d 00 00′ [ RiyajS + rowid combo].

    Insert-Undo does contain the index to remove, unlike the table version, which only specifies the rowid to delete (right?).
    So no 10.9 shenanigans needed to find the index entry either before or after splits/compacting.

    regards,

    Comment by Racer I. — November 5, 2018 @ 1:14 pm GMT Nov 5,2018 | Reply

  10. Hi,

    > If you want to dig into all the details, send me an email

    Thanks for the offer but I guess if I actually want to test various scenarios I better set up a virtual DB on my machine, where i can do block and redo dumps myself. Also after my last comment there isn’t so much left worth investigating. The true purpose of the 10.9 undo still eludes me but at the moment I wouldn’t even know what to look for. It might really only be crash protection. I.e. if the first 10.8 makes it to disk but the second doesn’t or maybe for whatever can cause a split to fail. It might never be used for regular rollbacks/CRs.

    regards,

    Racer I.

    Comment by Racer I. — November 6, 2018 @ 7:07 am GMT Nov 6,2018 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.