Oracle Scratchpad

March 7, 2010

Treedump – 2

Filed under: Indexing,Infrastructure,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 6:32 pm GMT Mar 7,2010

In an earlier article about investigating the state of an index in detail I supplied a piece of SQL that would analyse an index (no, not using the Analyze command) and summarise the number of entries in each leaf block that currently held any entries at all. Here’s a sample of the type of output it produced:

-------------- ---------- ---------- -------------
             1       8706       8706          8706
             2       4830       9660         13536
             3       2953       8859         16489
             4       1648       6592         18137
             5        997       4985         19134
             6        628       3768         19762
             7        300       2100         20062
             8        162       1296         20224
             9         87        783         20311
            10         52        520         20363
           227        100      22700         31871
           228        111      25308         31982
           229         84      19236         32066
           230         97      22310         32163
           231         77      17787         32240
           232         61      14152         32301
           233         54      12582         32355
           234        529     123786         32884
               ---------- ----------
sum                 32884    1887595

This is useful of course as it tells you a lot about the way that space is used in the index – but it only tells you how many blocks are in any given state, it doesn’t tell you whereabouts in the index those blocks are. It would be really nice if you could “draw a picture” of the index, showing where the gaps were and where it was densely packed.

I was thinking about this a few weeks ago and suddenly realised that it was a simple problem. Oracle gives you exactly the tools you need. I was travelling back from a client site to London and the  train ride was only an hour long so it gave me just enough time to generate a quick test to demonstrate the principle. ***

We start with an article on the treedump command, which I published a little while ago. This dumps a tracefile describing your index, one line per block of the index in correct index order.

----- begin tree dump
branch: 0x40035a 4195162 (0: nrow: 3, level: 2)
   branch: 0x40e2c0 4252352 (-1: nrow: 312, level: 1)
      leaf: 0x40035b 4195163 (-1: nrow: 240 rrow: 240)
      leaf: 0x40215f 4202847 (0: nrow: 218 rrow: 218)
      leaf: 0x40035c 4195164 (1: nrow: 444 rrow: 444)
      leaf: 0x40035d 4195165 (2: nrow: 444 rrow: 444)
       ... etc ...

This is the key to the problem; all you have to do is read the trace file and you’ve got all you need to know – except that it’s one line per leaf block, and that’s a lot of reading for a large index and that will make it hard to spot interesting patterns.

But it’s a trace file – and we can read trace files from SQL by making use of external tables; [update: but see this note for a more modern option for reading trace fles] and we can collapse data down to a more reasonable size by using analytic functions like the ntile() function. So here’s a little script that puts a few simple pieces together and comes up with a fairly useful tool.

rem     Script:         read_treedump.sql
rem     Dated:          Feb 2010
rem     Author:         Jonathan Lewis
rem     Purpose:        Using external files to analyze an index.
rem     Last tested:
rem     Not tested
rem     Not relevant
rem            -- no external tables
rem     Notes:
rem     See read_trace.sql for notes on reading trace files.
rem     Required Privileges:
rem     grant create any directory to the user.

connect test_user/test

start setenv
set timing off

execute dbms_random.seed(0)

set echo on
drop table t1;
set echo off

        begin           execute immediate 'purge recyclebin';
        exception       when others then null;

                when others then null;

        begin           execute immediate 'begin dbms_stats.delete_system_stats; end;';
        exception       when others then null;

        begin           execute immediate 'alter session set "_optimizer_cost_model"=io';
        exception       when others then null;

--      ------------------------
--      Create a table and index
--      ------------------------

define m_size=100000

create table t1
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                rownum <= 10000
        rownum                  id,
        lpad(rownum,10,'0')     vc_id
        generator       v1,
        generator       v2
        rownum <= &m_size

create index t1_i1 on t1(vc_id) nologging;

--      ---------------------------------------------------------
--      Create a "FIFO" pattern of space in the index leaf blocks
--      ---------------------------------------------------------

delete from t1
        id between 1 and 0.5 * &m_size
and     mod(id,100) != 0


delete from t1
        id between  0.5 * &m_size and 0.75 * &m_size
and     mod(id,50) != 0


delete from t1
        id between  0.75 * &m_size and 0.875 * &m_size
and     mod(id,25) != 0


delete from t1
        id between  0.875 * &m_size and 0.9375 * &m_size
and     mod(id,12) != 0


delete from t1
        id between  0.9375 * &m_size and 0.96875 * &m_size
and     mod(id,6) != 0


--      -------------------
--      Dump the index tree
--      -------------------

column  object_id new_value m_object_id

        object_name = 'T1_I1'

alter session set events 'immediate trace name treedump level &m_object_id';

--      -------------------------------
--      Now sort out the trace filename
--      It's easier in 11g
--      -------------------------------

column spid new_value m_spid

        trim(pro.spid)  spid
        v$process       pro,
        v$session       ses
        pro.addr = ses.paddr
and     ses.sid  = (
                select  sid
                from    v$mystat
                where   rownum = 1

define m_filename = 'd10g_ora_&m_spid..trc'
prompt Trace file name: &m_filename

--      ------------------------------------------
--      Set up an external table to read the trace
--      ------------------------------------------

define m_file_dir = 'c:\oracle\admin\d10g\udump'
define m_log_dir  = 'c:\working';

create or replace directory ext_tab as '&m_file_dir';
create or replace directory log_dir as '&m_log_dir';

drop table ext;

create table ext(text_line varchar(4000))
organization external
        type oracle_loader
        default directory ext_tab
        access parameters (
                records delimited by newline
                characterset us7ascii
                discardfile     log_dir:'read_trace_dis.txt'
                badfile         log_dir:'read_trace_bad.txt'
                logfile         log_dir:'read_trace_log.txt'
                        reject rows with all null fields
                        text_line       (1:4000)        char
        location ('&m_filename')
)       reject limit unlimited

--      -----------------------------------------------------
--      Read the treedump in order, selecting only lines with
--      "leaf" in them.
--      branch: 0x140020a 20972042 (0: nrow: 307, level: 1)
--         leaf: 0x140020b 20972043 (-1: nrow: 326 rrow: 3)
--         leaf: 0x140020c 20972044 (0: nrow: 326 rrow: 3)
--      Use instr() to get the text following the last ": ",
--      and select rownum to get an ordering on the lines.
--      Use substr() and length() to trim off the final ")"
--      then turn the result into a number and select ntile()
--      to collect the numbers into ordered batches.
--      Finally sum the batches - I've used 50 tiles because
--      that's a page length of numbers, you could use 100 or
--      more if you selected the result into a spreadsheet to
--      graph the results.
--      -----------------------------------------------------

spool read_treedump

set pagesize  60

        sum(leaf_row_count)     row_count
                ntile(50) over (order by rn)    section,
                )       leaf_row_count
                        rownum rn,
                                instr(text_line,':',-1) + 2
                        )                       text_line
                        instr(text_line,'leaf') != 0
group by
order by

spool off

If you feel like making more of it, you could probably turn it into a pipe-lined function taking the index schema and name as an input, together with the number of tiles, and returning the data set. You’d need to customise the code to identify the trace file location, and probably have a fixed setup for directories rather than creating them on the fly in the function. Once you’ve done this you could then set up a spreadsheet with a ‘select from table_function()’ call and pull the data straight into a chart.

Here’s the output I got from this demonstration case. As you can see, it shows that the index starts with a great long tail that is thinly populated and that a large fraction of the data is packed into the last 10% of the index, with most of that being in the right-hand 4% of the index.

---------- ----------
         1         22
         2         23
         3         23
         4         23
         5         23
         6         22
         7         23
         8         20
         9         19
        10         20
        11         19
        12         20
        13         20
        14         19
        15         20
        16         19
        17         20
        18         19
        19         20
        20         20
        21         19
        22         20
        23         19
        24         20
        25         31
        26         39
        27         39
        28         40
        29         39
        30         39
        31         39
        32         39
        33         39
        34         39
        35         39
        36         39
        37         40
        38         71
        39         78
        40         78
        41         79
        42         78
        43         78
        44        115
        45        163
        46        163
        47        201
        48        326
        49       1368
        50       1874
sum              5665

Reminder: a treedump walks the index in order, reading one block at a time – this can be a lot of work and take a lot of time. There are also versions where it does a full symbolic block dump for every leaf in the index, so test it on a very small index before trying to dump a large index.

*** Footnote:

Like many of the scripts I’ve published on this blog, the SQL shown here is “good enough” for me to use because I know how it works and I won’t have to use it often. It’s not intended to be a bullet-proof program in a “productised” state.

Update (Dec 2010)

(See comment 7 below) each line of the treedump is about 75 bytes long – so the whole trace file will be in the region of (75 * leaf_blocks / 1048576)MB in size. Make sure that you can dump a trace file that size before you start.  Check the parameter max_dump_file_size.


  1. Wonderful and useful.

    Comment by fli — March 9, 2010 @ 5:22 pm GMT Mar 9,2010 | Reply

  2. […] Leaf Block scanner: a labour-intensive analysis of leaf block usage for a simple B-tree index. Drawing an Index: An example of reading a treedump and using an analytic function to “draw a picture” of […]

    Pingback by Simple scripts « Oracle Scratchpad — March 10, 2010 @ 10:29 pm GMT Mar 10,2010 | Reply

  3. […] Drawing a picture of the index […]

    Pingback by Index analysis « Oracle Scratchpad — March 10, 2010 @ 10:44 pm GMT Mar 10,2010 | Reply

  4. […] 4-How to automate to read index treedump via external table ? Jonathan Lewis-treedump-2 […]

    Pingback by Blogroll Report 05/02/2010 – 12/03/2010 « Coskan’s Approach to Oracle — April 23, 2010 @ 1:07 am BST Apr 23,2010 | Reply

  5. […] Lewis @ 6:45 pm UTC Jul 1,2010 Some time ago I published a little script that showed you how to read an index treedump and summarise the leaf block usage in the logical order of the index leaf blocks – allowing […]

    Pingback by Index branches « Oracle Scratchpad — July 1, 2010 @ 6:46 pm BST Jul 1,2010 | Reply

  6. Hi Jonathan,

    I am trying taking tree dump for one of my non partitioned PK index which is sized at 289 GB. Its a 3 column composite index having compression enabled. I read the bottom part of your post saying, treedump will take lot of time. Its running for last 1 hr (index is having dba_indexes.leaf_blocks=37226645) and i am expecting more time. While i am fine with larger treedump time, i dont understand why the tracefile is not getting updated. For first 5 min it got 20MB and for last 55 min it is sized at same.

    However, i am observing ‘db file sequential read’ for the session which is taking treedump and i believe, this that is due to scanning of all leaf blocks one by one as you have mentioned in blog. My table holds financial records for last 5 years and it is never purged. There is no ETL but around 20-30 concurrent sessions are Inserting rows (conventional path insert) all the time (except weekend) through services. Last time when i did the index rebuild, i gained lot of gain in index size. This time, i want to analyze index structure to find WHERE-ABOUT of fragmented entries within index structure. Index ini_trans is 2 and PCT_FREE is set to 0 when it was created and rebuilt last time.

    I will update the status of my treedump test as soon as i have tracefile completed.

    Comment by Bhavik Desai — December 5, 2010 @ 8:59 am GMT Dec 5,2010 | Reply

    • Bhavik,

      It looks as if you have a max_dump_file_size set to the equivalent of 20MB. You don’t say which version of Oracle you are using, but the default in 10g is “unlimited” – so perhaps it has been deliberately restricted on your system.

      Each line of the trace will be about 75 bytes, and you have about 37M blocks, so you will get a trace file of about 2.5GB – 3.0 GB from the call. And, as you have observed about 37M single block read requests (although you may find that sometimes you get “db file parallel reads”.

      If your trace file size is limited, you could change it, before you start the dump, with:

      alter session set max_dump_file_size = unlimited;

      I have added a comment to both articles highlighting this point.

      Have you seen my note and script on estimating the “correct” size of an index ?

      Comment by Jonathan Lewis — December 5, 2010 @ 12:12 pm GMT Dec 5,2010 | Reply

  7. […] Dec 2010: (See this comment on a related post): Each line of a treedump takes about 75 bytes, so you need to be able to generate a tracefile of […]

    Pingback by treedump « Oracle Scratchpad — December 5, 2010 @ 12:27 pm GMT Dec 5,2010 | Reply

  8. Thanks Jonathan,

    As always,you are right in identifying the problem.I forgot to set tracefile size to unlimited. I did that and restarted treedump process again 555 min ago and its going on. Thanks for providing estimated trace file size. I have 45GB space available on my dump destination. I hope treedump will not lock anything and its just doing sequential or parallel read index block reads.

    Thought of giving you first few and last (from trace till now) few lines of trace file…
    My intention here is to identify how much space i am going to gain after doing index online rebuild and to understand the Index structure. (I am open to consider COALESCE or Shrink commands, but intention of index re-org is to gain storage.)

    branch: 0x40b700d 67858445 (0: nrow: 14, level: 4)
       branch: 0x161bada0 370912672 (-1: nrow: 118, level: 3)
          branch: 0x14036237 335766071 (-1: nrow: 112, level: 2)
             branch: 0x16002253 369107539 (-1: nrow: 187, level: 1)
                leaf: 0x40b700e 67858446 (-1: nrow: 205 rrow: 205)
                leaf: 0x40b700f 67858447 (0: nrow: 205 rrow: 205)
                leaf: 0x40b7010 67858448 (1: nrow: 205 rrow: 205)
                leaf: 0x44035e9 71316969 (2: nrow: 205 rrow: 205)
                leaf: 0x44035ea 71316970 (3: nrow: 201 rrow: 201)
                leaf: 0x44035eb 71316971 (4: nrow: 200 rrow: 200)
                leaf: 0x44035ec 71316972 (5: nrow: 200 rrow: 200)
                leaf: 0x44035ed 71316973 (6: nrow: 199 rrow: 199)
                leaf: 0x44035ee 71316974 (7: nrow: 190 rrow: 190)
                leaf: 0x44035ef 71316975 (8: nrow: 190 rrow: 190)
                leaf: 0x44035f0 71316976 (9: nrow: 190 rrow: 190)
                leaf: 0x7c82fc7f 2088959103 (215: nrow: 55 rrow: 55)
                leaf: 0x160fe38b 370140043 (216: nrow: 56 rrow: 56)
                leaf: 0x7c825de8 2088918504 (217: nrow: 45 rrow: 45)
                leaf: 0x160fe38c 370140044 (218: nrow: 104 rrow: 104)
                leaf: 0x160fe38d 370140045 (219: nrow: 49 rrow: 49)
             branch: 0x7c816ae7 2088856295 (76: nrow: 237, level: 1)
                leaf: 0x7c816aa7 2088856231 (-1: nrow: 57 rrow: 57)
                leaf: 0x160fe38e 370140046 (0: nrow: 105 rrow: 105)
                leaf: 0x160fe38f 370140047 (1: nrow: 105 rrow: 105)
                leaf: 0x160fe390 370140048 (2: nrow: 61 rrow: 61)
    ------------------------------ --------- -------------- ---------- ---------- ----------- ------------- ----------------- -------- ---------- -----------
    PK_AE_IDEMPOTENCY_IOT          UNIQUE             65536                     4    37226645    4482267569        3943298862 VALID    4482267569      139790

    (This is not an IOT index and table is a non-partitioned heap table).
    Index columns are :

    ----------------------------------- ------------ --------- ----------- ----------- ---------------
    EXTERNAL_REFERENCE_ID_TYPE                    36         0          23          13 FREQUENCY
    ACCTG_EVENT_TYPE_ID                          124         0          46           4 FREQUENCY
    EXTERNAL_REFERENCE_ID                 3327918080         0           1          36 NONE

    EXTERNAL_REFERENCE_ID column is being populated by an Oracle sequence. (INCREMENT_BY 10, cache 1000,last_number=89297318104)

    Comment by Bhavik Desai — December 5, 2010 @ 8:57 pm GMT Dec 5,2010 | Reply

  9. Bhavik,

    I don’t have time for a full analysis – but here are a few brief comments.

    The number of rows per leaf block at the start is consistent with the average column lengths of the index. Later on it is half the size – suggesting that you’ve hit the “ITL explosion” problem ( ). The fact that the leaf blocks oscillate between 50 and 100 rows per block also suggests that rows have been appearing out of sequence in a way that does not allow for “back-fill”, making your problem worse.

    Your sequence cache size looks too small, and you seem to be converting (by padding with leading zeros, perhaps) a 10 digit number into a 36 character string – which probably accounts for a huge fraction of your index size. It’s probably not feasible to change this – but it looks as if it needs changing.

    Given the very small number of distinct value for the first two columns in this index I think you should probably aim to rebuild it with compress 2 to reduce repetitions of the leading two columns in each leaf block.

    You might want to consider hash partitioning this index at least a few ways – eight or 16 , perhaps – given the impact of concurrency on your code. Alternatively it may be a nice idea to recreate the table as list partitioned on external_reference_id_type. Obviously there are lots of things about your application that I can’t know which might make both these ideas totally inappropriate.

    There is a recent bug fix for the ITL problem, by the way, see:

    Comment by Jonathan Lewis — December 5, 2010 @ 9:28 pm GMT Dec 5,2010 | Reply

  10. […] treedump 2: Another look at tree dumps. […]

    Pingback by Index ITLs « Oracle Scratchpad — December 5, 2010 @ 9:31 pm GMT Dec 5,2010 | Reply

  11. Thanks Jonathan for your time and valuable inputs.
    I forgot to mention (and you asked earlier) that my database is running on – 64bit Production (RHEL 5.3).
    I will definitely look at the suggestion you have given w.r.t partitioning options, increasing sequence cache ,padding issues with sequence# and compression on first 2 columns.
    My treedump trace is complete and here is output of sql that you listed:

       SECTION  ROW_COUNT                                                                                                                                                                         
    ---------- ----------                                                                                                                                                                         
             1  150249213                                                                                                                                                                         
             2  143753682                                                                                                                                                                         
             3  132210736                                                                                                                                                                         
             4  119418485                                                                                                                                                                         
             5  103893144                                                                                                                                                                         
             6   96642125                                                                                                                                                                         
             7   85314860                                                                                                                                                                         
             8   58189847                                                                                                                                                                         
             9   58775055                                                                                                                                                                         
            10   57281532                                                                                                                                                                         
            11   54244584                                                                                                                                                                         
            12   54219432                                                                                                                                                                         
            13   55920771                                                                                                                                                                         
            14   53104738                                                                                                                                                                         
            15   45265172                                                                                                                                                                         
            16   45333611                                                                                                                                                                         
            17   45662814                                                                                                                                                                         
            18   48355961                                                                                                                                                                         
            19   49588296                                                                                                                                                                         
            20   51226216                                                                                                                                                                         
            21   51475692                                                                                                                                                                         
            22   52844793                                                                                                                                                                         
            23   52739743                                                                                                                                                                         
            24   53343423                                                                                                                                                                         
            25   57688623                                                                                                                                                                         
            26   57589749                                                                                                                                                                         
            27   58223622                                                                                                                                                                         
            28   51777971                                                                                                                                                                         
            29   43859603                                                                                                                                                                         
            30   43892707                                                                                                                                                                         
            31   44052739                                                                                                                                                                         
            32   50383896                                                                                                                                                                         
            33   49954801                                                                                                                                                                         
            34   50000154                                                                                                                                                                         
            35   92424302                                                                                                                                                                         
            36  129564433                                                                                                                                                                         
            37  130020940                                                                                                                                                                         
            38  125626681                                                                                                                                                                         
            39  124406399                                                                                                                                                                         
            40  128474085                                                                                                                                                                         
            41  122793813                                                                                                                                                                         
            42  146882113                                                                                                                                                                         
            43  160786316                                                                                                                                                                         
            44  157359913                                                                                                                                                                         
            45  165812019                                                                                                                                                                         
            46  180526171                                                                                                                                                                         
            47  175796576                                                                                                                                                                         
            48  184090887                                                                                                                                                                         
            49  114441800                                                                                                                                                                         
            50  115861521                         
    50 rows selected.

    I have also ran index_efficiency_3.sql

    ROWS_PER_BLOCK     BLOCKS     ROW_CT CUMULATIVE_BLOCKS                                                                                                                                        
    -------------- ---------- ---------- -----------------                                                                                                                                        
                 5          1          5                 1                                                                                                                                        
                11          1         11                 2                                                                                                                                        
                16          2         32                 4                                                                                                                                        
                17        105       1785               109                                                                                                                                        
                18        260       4680               369                                                                                                                                        
                19        420       7980               789                                                                                                                                        
                20       1260      25200              2049                                                                                                                                        
                21       1897      39837              3946                                                                                                                                        
                22       1905      41910              5851                                                                                                                                        
                23       2053      47219              7904                                                                                                                                        
                24       2107      50568             10011                                                                                                                                        
                25       2298      57450             12309                                                                                                                                        
                26       3420      88920             15729                                                                                                                                        
               488          5       2440          36858666                                                                                                                                        
               490          1        490          36858667                                                                                                                                        
               493          1        493          36858668                                                                                                                                        
               498          1        498          36858669                                                                                                                                        
                   ---------- ----------                                                                                                                                                          
    sum              36858669 4484015189                                                                                                                     

    May i ask (definitely, respecting your time and bandwidth) for further analysis on what else (apart from ITL issue) have caused such large fragmentation ?

    Comment by Bhavik Desai — December 6, 2010 @ 1:48 pm GMT Dec 6,2010 | Reply

  12. Thanks Jonathan…

    I have confirmed that it is an ITL explosion case. I dumped the last leaf block and found that it has ITC=127. Interestingly, i dumped some random leaf blocks which were listed UP in the treedump (5k lines up than last leaf block) and found that ICT is 148..

    Giving you leaf block dump :

    Block header dump:  0x6cded763
     Object id on Block? Y
     seg/obj: 0x31a50  csc: 0x692.2f341beb  itc: 148  flg: E  typ: 2 - INDEX
         brn: 0  bdba: 0x6cdec205 ver: 0x01 opc: 0
         inc: 0  exflg: 0
    leaf block dump
    header address 47662361961988=0x2b5942276604
    kdxcolev 0    --&gt;index level (a value of 0 denotes this as a leaf block)
    KDXCOLEV Flags = - - -
    kdxcolok 0
    kdxcoopc 0xa0: opcode=0: iot flags=-C- is converted=Y
    kdxconco 3  --&gt;number of columns in the index.
    kdxcosdc 1
    kdxconro 100 --&gt;number of index entries in the index block.
    kdxcofbo 268=0x10c
    kdxcofeo 2757=0xac
    kdxcoavs 2254 --&gt;available free space within the block
    kdxlespl 0
    kdxlende 0 --&gt;number of index entries that have been marked as deleted within the block 
    kdxlenxt 0=0x0
    kdxleprv 1872632536=0x6f9e1ad8
    kdxledsz 6
    kdxlebksz 4528 --&gt;actual maximum usable space within the index block (basically, it is block_size(8192) less the block header “overheads”). I am guessing overhead =9
    kdxlepnro 3
    kdxlepnco 2
    prefix row#0[4498] flag: -P----, lock: 0, len=30
    col 0; len 23; (23):
     54 72 61 6e 73 66 65 72 52 65 63 65 69 76 65 41 63 74 69 6f 6e 49 44
    col 1; len 3; (3):  c2 0f 06
    prc 100
    row#0[2498] flag: ------, lock: 0, len=20, data:(6):  27 9d 8a f7 00 99
    col 0; len 11; (11):  35 33 38 33 39 33 32 30 30 30 37
    psno 0
    row#1[2518] flag: ------, lock: 0, len=20, data:(6):  27 9d 8a a3 00 1c
    col 0; len 11; (11):  35 33 38 33 39 33 32 31 30 30 37
    psno 0
    row#2[2538] flag: ------, lock: 0, len=20, data:(6):  27 9d 8b 1d 01 11
    col 0; len 11; (11):  35 33 38 33 39 33 32 32 30 30 37
    psno 0
    row#3[2558] flag: ------, lock: 0, len=20, data:(6):  27 9d 8a fc 01 21
    col 0; len 11; (11):  35 33 38 33 39 33 32 33 30 30 37
    psno 0
    row#4[2578] flag: ------, lock: 0, len=20, data:(6):  27 9d 8a 76 00 5c
    col 0; len 11; (11):  35 33 38 33 39 33 32 34 30 30 37
    psno 0

    I am bit skeptic about space utilization and guessing that there is something else that i am not considering while calculating leaf block size.

    Z=kdxlebksz (4528) –>Max.Usable_Space
    Y=kdxcoavs (2254) —>AVAILABLE_SPACE
    Therefore USED_SPACE, X=(Z-Y)=2274 (2274 is space occupied by all rowids in leaf block)
    Block over_head for B-tree index=9 bytes
    SO,TOTAL_SIZE_IN_LEAF_BLOCK=(9+3352+4528)=7889 which is 303 Bytes less than 8192…Where did 303 bytes go ?

    My PK index is a 3 column compressed index and compression is set for first 2 columns. Since it is a composite index, i am sure that 50-50 leaf block split have also occurred despite the fact that last column is a sequence based.
    Here is an order of index population :
    1) Table is created without any data
    2) Pk index is created on that table with PCTFREE=0,INI_TRANS=2 AND compression set on first 2 columns
    3) Populated 90M rows from an existing table (INSERT APPEND PARALLEL)
    4) Started the application service to load new data into table and thus into index (Conventional path INSERT-~30-50 sessions inserting at a time)

    Application(Service) has observed ‘low commit time’ couple of times in past that suggests that,# of uncommitted transactions might have gone up during that time and thus INI_TRANS is increased up-to 148… (This is just my guess from application low commit time issue)

    Comment by Bhavik Desai — December 8, 2010 @ 8:54 am GMT Dec 8,2010 | Reply

    • Actually,let me take it back….I am counting Overhead only one time (9 bytes), but it is actually (9*kdxconro-#of rowids in that leaf block)=(9*100=900)
      SO,TOTAL_SIZE_IN_LEAF_BLOCK=(900+3352+4528)=8780…this is now more than 8192 !
      i don’t know where i am making mistake !

      Comment by Bhavik Desai — December 8, 2010 @ 10:49 am GMT Dec 8,2010 | Reply

    • Hi Jonathan,

      I tried creating a HASH partitioned IOT on last column (128 hash partitions). But partition movement can cause significant outage to application service which is loading records in IOT (Conventional path insert). Inserts are happening 24*7 and whenever i do alter table iot move partition x; it will lock that particular partition and hence, i would either get error (ORA-00054: resource busy ) while partition movement or my insert will wait till the partition where it is trying to insert is moved.

      Comment by Bhavik Desai — December 14, 2010 @ 2:03 pm GMT Dec 14,2010 | Reply

  13. Bhavik,

    I have had a look at your results – and will make a few comments as soon as I have a little free time. IN the meantime – is the index compressed on the first two columns, and how much variation appears in the lengths of the values used in the three columns ?

    Comment by Jonathan Lewis — December 8, 2010 @ 9:35 pm GMT Dec 8,2010 | Reply

  14. Thank you very much Jonathan…
    Yes…index is indeed compressed on first two columns and i believe,compression ration is very good.
    If you look at the leaf block dump that i provided, you would find that all 100 rowids in that block is sharing the same PREFIX row (prc 100). I believe, your question to me is : to find variation in length of three columns in DIFFERENT leaf block dumps….am i right ?…If yes..then…here it is :

    Position_Of_Block_In_Treedump         #rows_in_leaf_block(kdxconro)     ITC      prefix      Len (for row#0 and row#1)
    First leaf block                       205                               2        205         36,36
    Middle Leaf block                      58                                3         58         6,6
    100k line up from bottom               338                               2        338         20,20
    70k line up from bottom                120                             118        120         20,20
    50k line up from bottom                110                             131        110         20,20
    30k line up from bottom                120                             117        120         20,20
    5k line up from bottom                 100                             126        100         20,20
    1k line up from bottom                 117                             126        117         19,19
    last leaf block                        108                             126         85         19,19

    Let me know what else i can provide for better analysis ?
    Many thanks in advance…

    Comment by Bhavik Desai — December 9, 2010 @ 6:45 am GMT Dec 9,2010 | Reply

    • Bhavik,

      I’m running at least 24 hours behind your comments – I hadn’t even got as far as reading your blockdump when I wrote my last comment, so I didn’t notice you’d answered my question long before I asked it.

      My initial comments about compression and the left-padding of the sequence number was based on the indications of ca. 200 rows per blocks. Your treedump showing more than 400 rows per leaf block suggested that the left-padding was wrong, or you were compressing, or both.

      I’ve just spent a couple of minutes looking at the block dump (and your description of the initial load), which probably answers all the questions I might want to ask about how you’ve implemented this index. I don’t think there’s anything surprising about what your seeing – and I may find time to write up some comments this evening (or over the weekend).

      I might extract some of the material and turn it into a blog item rather than leaving a chain of comments.

      Comment by Jonathan Lewis — December 9, 2010 @ 8:03 am GMT Dec 9,2010 | Reply

      • Thanks you Sir…
        I really got excited when i see your reply, whether directly on your blog or on OTN ! In past,We did talk personally when you came to India (Hyderabad) couple of months back on FREELIST-32 for MSSM tablespace to avoid BBW and NLNVL function to handle NULL value. Your suggestions were awesome and it did bring lot of performance enhancements in my critical DW.

        Comment by Bhavik Desai — December 9, 2010 @ 8:16 am GMT Dec 9,2010 | Reply

    • Bhavik,

      A few summary comments on the information supplied.

      Point 1:
      Your observation that this index will not be doing 90/10 splits even though it involves a sequence is correct. The sequence is not the leading column – and that is a pre-requisite for 90/10 splits.

      Given the number of rows – 4.48 Billion – and number of distinct values for the leading pair of columns (EXTERNAL_REFERENCE_ID_TYPE 36 values, ACCTG_EVENT_TYPE_ID 124 values) there could be 4,464 leading combinations with about 1,000,000 rows each. In this type of scenario the index will, in effect, consist of 4,463 sections doing 50/50 splits with only the top section able to do 90/10 splits. In effect it will behave roughly like 4,464 separate indexes jammed together (and, in passing, the compression effects will be extremely good – the one prefix per block you’ve seen.)

      Point 2:
      There are some contradictions in the information you’ve supplied. The average column length of EXTERNAL_REFERENCE_ID is 36 – but you’ve pointed out that this is a sequence number (increment 10, cache 1000, current high value at 11 digits) that has been stored as a varchar2. In the block dump you supplied we can see that the value is stored as a simple “to_char” of a number, taking 11 bytes – if this is always the case, how does the average column length get to be 36 ? (In passing, the values in the block you’ve dumped are increase by 1,000 each row – that’s a bit of a coincidence I’d want to investigate.)

      In one of your other listings you show the “row #0” lengths at 6, 19, 20, and 36 bytes – the first might be (must be) deleted rows, the middle pair show rows that look appropriate for storing 10 and 11 digit numbers as character strings, the highest would require a 27 character string. So (a) you seem to have been deleting or updating key values and (b) you seem to have used (or be using) more than one strategy for storing the EXTERNAL_REFERENCE_ID. (And none of your examples show the row#0 length of 44 that would be consistent with an average column length of 36).

      Point 3:
      You loaded 90 Million rows into the table/index initially with pctfree 0, and you mention you have rebuilt it in the past. On the load, and on the rebuild, you will have packed the index to 100%, with no spare ITL entries. As soon as the concurrent processes started to run they would cause the top blocks of each section of the index to do 50/50 block splits (apart from the very top block which would do 90/10).

      Assuming some combinations of (EXTERNAL_REFERENCE_ID_TYPE, ACCTG_EVENT_TYPE_ID) are more popular than others, some sections of the index would see a higher level of concurrency than the others – so they would be more likely to accumulate more ITL entries on the leaf node split more rapidly. (This explains why you can get “randomly” varying results for ITC values across the index – it’s behaving like 4,464 separate indexes).

      Rows with the longer values for the row#0 (ca. 40) figures will pack a block up to about 200 entries. Rows with the shorter values (ca. 20) will pack a block up to about 400 entries. Blocks which have seen deletes (row#0 = 6) may show enough variation to account for the blocks you reported with nearly 500 rows – especially if this was a very small fraction of the blocks.

      So we should not be surprised to see: blocks with 200 or 400 rows (the initial load / rebuild effect), blocks with 100 or 200 rows (50/50 leaf node split), and blocks with 50 to 100 rows (leaf node split plus worst case ITL bug).

      Obviously, since I don’t know what you are really doing with the EXTERNAL_REFERENCE_ID, there may be lots of noise surrounding this pattern. Similarly, since the ITL bug introduces extra ITL entries over time at a rate that is loosely dependent on “local” concurrency and a bit of bad luck, there may be further noise surrounding the figures.

      Point 4:
      Possible action points:
      a) Find out what is really going on with the EXTERNAL_REFERENCE_ID and see if you can sanitise it.
      b) Work on getting the patch from Oracle for the ITL bug
      c) If space is really significant, rebuild the index.
      d) If space is not at a premium, rebuild the index ONLY IF you can see that the physical I/O on the index is a significant contributor to any performance problems (See “Segments by Physical Reads” in the AWR as a guide.) Note – if all your queries are about ‘recent’ EXTERNAL_REFERENCE_IDs then your system may work perfectly well if it can cache the last few blocks from each of the 4,464 index sections to avoid physical I/O on the index; so even if the whole index is wasting space, the way you are using the index may mean it is not causing an I/O problem.
      e) To minimise hot spots and to make future rebuilds easier (if you think it’s likely to be necessary), and if your important queries always include predicates like ‘EXTERNAL_REFERENCE_ID = {constant}’ rebuild the index as a hash-partitioned index on EXTERNAL_REFERENCE_ID – perhaps as many as 128 partitions. If you’re on 11g you might look at list/hash composites (if that’s legal, I’d have to check it myself) – listing on external references types, hashing on EXTERNAL_REFERENCE_ID. You might even consider rebuilding the table as a partitioned table with local indexing.

      f) Create (by writing a script – see my blogs on frequency histograms) a histogram on the first two columns of the index: you may find the optimizer working better if it knows fairly precisely the distribution of data for those values. If you’re on 11g and the number of combinations is a long way off the 4,464 I’ve been talking about then create some extended stats on the combination to tell Oracle the number of distinct combinations – it may help in some cases (again, write a program after an initial check.) Note that your “num_buckets” on the two columns was less than the num_distinct – this means that you have some values which don’t appear in the histogram, which may make the optimizer produce some silly cardinality estimates.

      g) Any time you do rebuild the index, keep a log of WHEN you took it, and how big it was after the rebuild: any future estimates of how inefficient the index might be (before you do the expensive checks) needs to know what baseline you started from.

      h) I note that the block dump you supplied showed the EXTERNAL_REFERENCE_ID_TYPE as “TransferReceiveActionID”. Because of the index compression this is not wasting storage in the index – but if you really are pushed for space you could consider storing this column as a code number in the table – and introduce lists of constraining values, views, and a function-based index, to recreate access paths that use the text content.

      Comment by Jonathan Lewis — December 12, 2010 @ 7:42 pm GMT Dec 12,2010 | Reply

      • Thanks a ton Jonathan…
        I really appreciate that you REMEMBER to post comments.Many thanks for the time you spent on analyzing and detailed write-up…

        Here are my answers…
        Point 1: Totally agree with you.

        Point 2: (There are some contradictions in the information you’ve supplied.)
        You are absolutely correct and i think ,the other day,i prematurely wrote how we populate the last col. I checked with application users and the developers and found that, application service collects row data from multiple (11+) source dbs and ‘EXTERNAL_REFERENCE_ID’ is PK-(Being populated through sequence in source db-NOT MY DB)….We are not using any sequence for this table. Apart from this, there were many character based unique values that are pre-existing in source EXTERNAL_REFERENCE_ID which are sized at 36 characters.(i believe this is due to some legacy code)
        That confirms that your assumption on strategy (b) is correct.

        Point 3: Got that….Thanks very much for clearing that doubt.

        Point 4:
        (a)I am working with the developers to find out what exactly is going on?
        (b) Yes..My db is running on I am opening SR to get PATCH-SET for ILT issue… (BTW ..Do you remember any patch-set or BUG# ?)
        (c) is important but i can delay the re-build process for next 30 days.
        (d)As of now, there is no complaint w.r.t. PIO. But i think, we are going to have IO issues as well going forward. This PK is basically used for idem-potency check. So, for every new record that is to be inserted, a trigger will check unique combination of these 3 columns in base table (through PK)…if no record found than, trigger will populate the IOT table and thus index will get populated (Just any fyi…IOT is just a name of a non partitioned table..its not an Index Org. Table)
        Since,PK height is 5, PK access for idem-potency check requires 5 IO to check uniqueness within index b-tree structure…and if rebuild is releasing significant amount of space and hence lowing index b-level, then IO is going to be improved by a huge factor.
        (e)Partitioning is definitely a great solution you are giving. Last weekend i worked on it.Actually, i am trying creating a real 128-HASH partitioned INDEX ORGANIZED table on EXTERNAL_REFERENCE_ID column having local PK on given 3 columns. But it think below two restrictions are going to create some noise going ahead…(though i have not tested that)..BTW…LIST-HASH partitioning is not supported for IOT in my 11g version.

        1)A partitioned IOT rebuild (move partition) cannot run in PARALLEL. Parallel option is not supported for IOT.
        2)Since i am creating HASH partitions,application service can write record into any of the HASH partitions amongst 128 partitions.So, if i am moving partitions (for re-org purpose….if required…),then MOVE statement can lock that partition and hence application cannot insert until partition movement is complete.

        (f)i will check histogram option too..i will update the blog with histogram information soon.
        (g) Point noted..
        (h)Again a great finding. Many thanks…i am going to check with developers to evaluate feasibility to accommodate this change.

        Comment by Bhavik Desai — December 13, 2010 @ 8:10 am GMT Dec 13,2010 | Reply

        • Height 5 is not necessarily a big deal – especially if you are not seeing any physical I/O problems relating to this index. There’s a fair chance that whether the height is 4 or 5 your I/O threat will only come from LEAF blocks, and the excess work will simply be the one extra LIO from the higher root block. Still, you may find that hash partitioning 128 ways saves you that one LIO and spreads out a little contention. You may even find that a larger number of partitions (256) is enough to drop you to height 3 – but you always have to consider the possible side effects of the number of partitions on less precise queries.

          I don’t want to see the histogram figures – that was just an option for you to investigate.

          When you rebuild, you should probably do it at PCTFREE 0 – if it’s only the legacy data that is non-sequential then it needs to be packed; and “4,464” index sections are only going to split on their right hand side, so they might as well start at 100% packed – you’re not “back-filling” old leaf blocks.

          One reason for looking at partitioning to small chunks (ca. 2GB at 128) is that you can rebuild each partition individually if it really needs it. With an IOT you can do “alter table XXX move online”. I think you can do this to a single partition in 11g (but you’ll have to check – it wasn’t possible in earlier versions.

          Comment by Jonathan Lewis — December 13, 2010 @ 7:25 pm GMT Dec 13,2010

  15. Hi Jonathan,

    Thank you for this nice way to investigate the state of an index.
    I used it many times to diagnose mainly FIFO index issues.
    I added two statistics in the report :
    – leaf count per section
    – empty leaf count per session (leafs with rrow = 0)
    This was useful to show the empty blocks on the “left side” of FIFO indexes after significantly reducing the retention period.

    Lately I found an index with nearly 50% empty blocks which is still claiming new extents.
    The empty blocks are quite evenly distributed in the 85% leftmost sections.

    My understanding is that leaf blocks with rrow = 0 are empty and are candidate to be reused.

    Did I miss something ?

    Thank you for your help.

    Comment by Vincent — January 7, 2013 @ 5:52 pm GMT Jan 7,2013 | Reply

    • Vincent,

      Thanks for the comment. That sounds like a nice extension to the query, making it easier to see the scale of the problem.

      You’re correct that empty blocks (rrow=0) should be re-used before new blocks are allocated but there seem to be various timing problems (possibly only related to very large deletes) with updating bitmap blocks under ASSM – I’ve just run a little test deleting all the rows from a few index leaf blocks, and the “emptyness” dropped to the range 25% – 50% rather than 0% – 25%, which means that at a high level Oracle can’t think the leaf blocks are empty.

      You might note that rrow drops to zero BEFORE the commit, so you could also imagine some problems relating to the timing of the commit, delayed block cleanout etc. even when using freelist management. You may find that Oracle manages to go through a cycle of using a few empty blocks, adding a few new ones, using a few empty, and so on.

      Comment by Jonathan Lewis — January 12, 2013 @ 11:45 am GMT Jan 12,2013 | Reply

RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by

%d bloggers like this: