Oracle Scratchpad

January 9, 2022

OERI 6051

Filed under: Block Size,humour,Indexing,Infrastructure,Oracle — Jonathan Lewis @ 9:27 pm GMT Jan 9,2022

How smashed (in the non-alcoholic sense) can an index be?

One of the components in the cost calculation for an indexed access path is the “blevel” (branch-level) or, indirectly, the “height” of the index. Both count the steps from the root block down to a leaf block (and all leaf blocks are at the same distance from the root – that’s the meaning of “balanced” in the expression “balanced B-tree”) but the height includes the leaf level in the count while the blevel excludes it and counts down only to the lowest level of branch blocks (so height = blevel + 1).

In many cases you will find that even with a few million entries in a single index segment the height may still be only 3 (blevel = 2), and it may take a few tens of millions of rows before an index needs to grow to height = 4 (blevel = 3).

It’s often the case that the number of index entries per leaf block and block pointers per branch block is around 200 to 400, so the rate at which the height/blevel grows is tiny compared to the rate at which the number of rows in the index increases. But algorithms often have weak points, and some time around the year 2000 I started demonstrating an edge case where I could crash a session in less than 3 seconds (and most of that time was spent on Oracle creating the crash dump) by inserting just 25 (carefully designed) rows into a table.

I published an article about this in 2005 (see footnote), but since then the algorithm has changed. My demo worked in versions up to; but in later versions Oracle Corp. modified the way that index blocks (possibly just the branch blocks) split at the low end of the index making the harder to achieve a crash. If you have a MOS account you can check Doc ID 1748260.8: OERI:6051 possible during index manipulation.

The change wasn’t a response to my demo, of course; it was in response to a problem that could occur in production systems running some of the big “database agnostic” accounting or HR or CRM systems that created huges indexes on multiple columns. Even when the crash never occured the nature of the application and its indexing strategy could result in some indexes growing to a ridiculous height that made a dramatic difference to the cost calculations (hence the desirability of the “best” index).

It’s harder, and less likely to happen in the wild, but it’s still possible to make the same crash occur even in the newest versions of Oracle. It will (probably) take roughly 8 million (power(2,23) + 1) rows and 32GB of space to crash (or 128GB if you want to play nicely and use an 8KB block size – and tweak my code a little further).

Richard Foote spotted a slide with a surprising blevel in a short presentation about CBO arithmetic by Maria Colgan a couple of days ago, so I thought it would be entertaining to tweak the old code to see if it could still cause the crash. So here it is:

rem     Script:         silly_index_3a.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2004 / Jan 2022
rem     Purpose:        Build an index with a large blevel
rem     Notes: 
rem     Uses 2K block size for tablespace holding the index
rem     Estimated run-time (for me) with m_blevel = 23 - ca. 1 hour

define m_blevel = 5
define m_rows = power(2,&m_blevel)

drop table t1 purge;
create table t1 (v1 varchar2(8));

create index t1_i1 on t1(substrb(lpad(v1,1469,'0'),1,1469))
tablespace test_2k

-- execute snap_my_stats.start_snap

prompt  ===================================================
prompt  Inserting &m_rows (short) rows in reverse order
prompt  ===================================================

        for i in reverse 1..&m_rows loop
                insert into t1 values (i);
--              commit;
        end loop;

-- execute snap_my_stats.end_snap

prompt  ================
prompt  Validating index
prompt  ================

validate index t1_i1;

        lf_rows, height, height-1 blevel, lf_blks, br_blks

column  object_id new_value m_object_id

select  object_id
from    user_objects
where   object_name = 'T1_I1'

alter session set events 'immediate trace name treedump level &m_object_id';

insert into t1(v1) values('0');

I’ve precreated a tablespace called test_2k with a block size of 2KB for this demo; you’ll need a couple of percent over 32GB for this tablespace.

This script then creates a table in my default tablespace to hold a small character column, and a function-based index on that column that produces a character result of 1469 bytes (which gives me the largest possible index entry that’s allowed in a 2KB block size). The older version of the code used a simple lpad() to do this, but the newer versions decided that that would produce up to 2*1,469 bytes thanks to my default character set – hence the substrb(), note, especially the b for byte.

With the structure in place I’ve then inserted numeric values in descending order into the table so that the index is constantly doing leaf block splits at the left hand (low) end.

Once I’ve populated the table I use a call to validate index so that I can report the number of rows, leaf blocks, branch blocks and the height (and blevel) of the index; then I find it’s object_id so that I can do a treedump of it.

For m_blevel = 5, here are the results of the query against index_stats after the call to validate the index:

---------- ---------- ---------- ---------- ----------
        32          6          5         32         31

As you can see, setting m_blevel = 5 I get an index with blevel = 5, and 2^5 leaf blocks each holding one row. If you set m_blevel to 23 you’ll end up (after about 1 hour, probably) with a blevel of 23 and 8,388,608 rows and leaf blocks (and branch blocks = leaf blocks – 1: hence the 32GB+ requirement for the tablespace … 16M blocks at 2KB per block, plus ASSM overheads).

To show you what’s happening inside the index here’s the treedump (from with m_blevel = 5

branch: 0x4c00204 79692292 (0: nrow: 2, level: 5)
   branch: 0x4c0022a 79692330 (-1: nrow: 2, level: 4)
      branch: 0x4c00212 79692306 (-1: nrow: 2, level: 3)
         branch: 0x4c00206 79692294 (-1: nrow: 2, level: 2)
            branch: 0x4c0020c 79692300 (-1: nrow: 2, level: 1)
               leaf: 0x4c00209 79692297 (-1: row:1.1 avs:370)
               leaf: 0x4c00249 79692361 (0: row:1.1 avs:370)
            branch: 0x4c00248 79692360 (0: nrow: 2, level: 1)
               leaf: 0x4c00247 79692359 (-1: row:1.1 avs:370)
               leaf: 0x4c00246 79692358 (0: row:1.1 avs:370)
         branch: 0x4c00244 79692356 (0: nrow: 2, level: 2)
            branch: 0x4c00243 79692355 (-1: nrow: 2, level: 1)
               leaf: 0x4c00242 79692354 (-1: row:1.1 avs:370)
               leaf: 0x4c0024f 79692367 (0: row:1.1 avs:370)
            branch: 0x4c0024e 79692366 (0: nrow: 2, level: 1)
               leaf: 0x4c00235 79692341 (-1: row:1.1 avs:370)
               leaf: 0x4c00239 79692345 (0: row:1.1 avs:370)
      branch: 0x4c00238 79692344 (0: nrow: 2, level: 3)
         branch: 0x4c00237 79692343 (-1: nrow: 2, level: 2)
            branch: 0x4c00236 79692342 (-1: nrow: 2, level: 1)
               leaf: 0x4c00234 79692340 (-1: row:1.1 avs:370)
               leaf: 0x4c00233 79692339 (0: row:1.1 avs:370)
            branch: 0x4c00232 79692338 (0: nrow: 2, level: 1)
               leaf: 0x4c00231 79692337 (-1: row:1.1 avs:370)
               leaf: 0x4c00230 79692336 (0: row:1.1 avs:370)
         branch: 0x4c0023f 79692351 (0: nrow: 2, level: 2)
            branch: 0x4c0023e 79692350 (-1: nrow: 2, level: 1)
               leaf: 0x4c0023d 79692349 (-1: row:1.1 avs:370)
               leaf: 0x4c0023c 79692348 (0: row:1.1 avs:370)
            branch: 0x4c00225 79692325 (0: nrow: 2, level: 1)
               leaf: 0x4c0022d 79692333 (-1: row:1.1 avs:370)
               leaf: 0x4c0022c 79692332 (0: row:1.1 avs:370)
   branch: 0x4c0022b 79692331 (0: nrow: 2, level: 4)
      branch: 0x4c00229 79692329 (-1: nrow: 2, level: 3)
         branch: 0x4c00228 79692328 (-1: nrow: 2, level: 2)
            branch: 0x4c00227 79692327 (-1: nrow: 2, level: 1)
               leaf: 0x4c00226 79692326 (-1: row:1.1 avs:370)
               leaf: 0x4c00224 79692324 (0: row:1.1 avs:370)
            branch: 0x4c00223 79692323 (0: nrow: 2, level: 1)
               leaf: 0x4c00222 79692322 (-1: row:1.1 avs:370)
               leaf: 0x4c0022f 79692335 (0: row:1.1 avs:370)
         branch: 0x4c0022e 79692334 (0: nrow: 2, level: 2)
            branch: 0x4c00215 79692309 (-1: nrow: 2, level: 1)
               leaf: 0x4c00219 79692313 (-1: row:1.1 avs:370)
               leaf: 0x4c00218 79692312 (0: row:1.1 avs:370)
            branch: 0x4c00217 79692311 (0: nrow: 2, level: 1)
               leaf: 0x4c00216 79692310 (-1: row:1.1 avs:370)
               leaf: 0x4c00214 79692308 (0: row:1.1 avs:370)
      branch: 0x4c00213 79692307 (0: nrow: 2, level: 3)
         branch: 0x4c00211 79692305 (-1: nrow: 2, level: 2)
            branch: 0x4c00210 79692304 (-1: nrow: 2, level: 1)
               leaf: 0x4c0021f 79692319 (-1: row:1.1 avs:370)
               leaf: 0x4c0021e 79692318 (0: row:1.1 avs:370)
            branch: 0x4c0021d 79692317 (0: nrow: 2, level: 1)
               leaf: 0x4c0021c 79692316 (-1: row:1.1 avs:370)
               leaf: 0x4c00208 79692296 (0: row:1.1 avs:370)
         branch: 0x4c00207 79692295 (0: nrow: 2, level: 2)
            branch: 0x4c00205 79692293 (-1: nrow: 2, level: 1)
               leaf: 0x4c0020f 79692303 (-1: row:1.1 avs:370)
               leaf: 0x4c0020e 79692302 (0: row:1.1 avs:370)
            branch: 0x4c0020d 79692301 (0: nrow: 2, level: 1)
               leaf: 0x4c0020b 79692299 (-1: row:1.1 avs:370)
               leaf: 0x4c0020a 79692298 (0: row:1.1 avs:370)
----- end tree dump

As you can see, every branch block (which includes the root block) holds exactly 2 entries, and every leaf block holds just one row.

Once you’ve tested the code with a couple of small starting values you might want to skip the validate index and treedump steps – they might take quite a long time (especially since the treedump will write a trace file of 16M+ lines). The other thing to watch out for is that the script will generate something like 200GB of redo and 72GB of undo – so you might want to remove the comment marker from the commit in my PL/SQL loop and check your auto undo settings and auto extend settings on the undo files.

I should also point out that I’ve used 1469 for the substrb(lpad()) because that’s the largest string I can use for the index definition – but the same index pattern of use (i.e. one row per leaf block, two children per branch block) will appear if you reduce this (for a 2KB block size, using ASSM) to 915 bytes. (And this is instructive because if you use the smaller size and then eliminate the “reverse” in the loop the index branch blocks pack more efficiently and the blevel is smaller (even though the index leaf-block count is unchanged.)

In the good old days ( and earlier) the maximum allowed height for a B-tree index was 24 (blevel = 23). I haven’t got a spare 32GB in any of my virtual machines at present so I haven’t checked to see if this is still true; but if you do run a test with m_blevel – 23, the final line of the script (inserting a zero row) should result in an ORA-00600 error with first parameter 6051 if the limit hasn’t changed.


Here’s a link to the original document (how_high.doc) that I wrote and published in dbazine in 2005, now available (very slowly) on the Wayback Machine, and made redundant by a change in the branch block split algorithm in 10g.

Footnote 2

At some point I did discover a bug note on MOS (Metalink, in those days) that reported a performance problem due to an index with a very high blevel (I have a vague memory of it reaching double digits, but not very getting close to the limit – possibly something like 15). So there is a serious point to this post – bad column definitions with bad index definitions (and a silly block size) and a bit of bad luck with the pattern of data insertion can lead to an unexpected problems.

August 3, 2017

Block Sizes

Filed under: Block Size,Infrastructure,Oracle — Jonathan Lewis @ 7:31 pm BST Aug 3,2017

Here’s an old draft based on a very old tpc-c test that I’ve only just rediscovered. There are a couple of brief comments at the end that I’ve added just before publishing, otherwise this is just as I wrote it years ago.

Several years ago I copied some results from a couple of TPC-C reports of what were then, if I recall correctly, the top performance results for the current Oracle version. I made the copy May 2010, but never got around to writing up a note commenting on them – but the comment I have to make is pretty timeless so here it is, only seven years late [update: now 10 years, presumably I wrote these notes in 2017]

One of the reports was for a Sun SPARC Enterprise T5440 Server Cluster running RAC on version with 12 instances, the other for an HP Integrity Superdome-Itanium2/1.6GHz/24MB iL3 running single instance The results aren’t intended as any sort of reflection on current Sun and HP equipment, or current performance; I ‘ve published them together to highlight one simple idea. How do you answer someone who asks questions like: how big should my cache size be, should I use a keep or recycle cache, should I use a different block size for my indexes ?

Take a look at these two sets of figures from (we assume) the best experts in their particular platform when they are trying to address exactly the same performance problem.

 Sun Sparc  HP Superdome
 db_2k_cache_size  60G
 db_[4k_]cache_size  2G  1G  (default cache)
 db_[8K_]cache_size 65G (default cache)  10G
 db_16K_cache_size  95G  476G
 db_keep_cache_size  210G  1220G
 db_recycle_cache_size  40G  16G

Clearly the HP has about 1783G allocated to the total buffer cache compared to “only” 415G on the Sun – a factor of just over 4 difference – so it’s not surprising that the values for the individual cache are rather different. What is significant, though, is that the values are not consistently different.

Why isn’t the Sun using the 2K at all – surely if it’s a good idea to use very small blocks for something it’s a good idea for both platforms. In fact it was a common claim in some quarters that small blocks were a really good idea for RAC and it was the Sun that was running RAC.  (Of course that doesn’t really matter if you can rig your system to avoid cross-instance traffic.)

Why is the default cache on the Sun 6 times the size of the 8K cache on the HP rather than being in scale at roughly 1/4; alternatively why is the default cache on the Sun 65 times the size of the default cache on the HP (remember some tablespaces have to match the default block size).

Similarly the recycle caches show the same reversal in their relative sizes, Why ?

Why, since the 16K block size has been seen as “quite nice” has neither platform gone to the even nicer 32K ? One of the loudly touted myths of the time was that bigger blocks were good for indexes – but neither test uses the largest possible block size.  (Actually, I have to note that I think the HP implementation didn’t allow for 32KB Oracle blocks, but I may be wrong about that.)

The purpose of this post isn’t to question exactly why the two configurations are significantly different for the same task – the purpose is to point out that picking a block size, or set of block sizes, or use of the keep and recycle caches, isn’t a trivial task with a knee-jerk right answer that you can get by posting a question on OTN or reading a couple of brief articles on a web-site.


If you have to ask for help choosing the best block size – use the default for your platform, and don’t mess with the keep and recycle unless you’re prepared to pay an expert to spend a little time helping you, and then spend quite a lot of time doing the tests that they should have suggested.


Addendum July 2020

It’s a very long time since I read any of the tpc-c reports – they tend to run to dozens of pages of description of exactly what hardware and software has been used, what scripts were used to populate tables, what parameters were set, and so on. One thing I do remember from reading a few of them is that they weren’t realistic attempts to implement a working system – they were only attempts to take a “realistic” data set but get it into the database in a way that would allow the tpc-c test to run as quickly as possible.

For example –

if you use a 1TB disc to store 100GB of data you can position the data on the fastest part of the disc

if you create a critical table as a single table hash cluster you can access all the rows for a single key value very quickly if you pick a block size that will hold exactly all the rows for one key. And when you notice that the table can store all the data for a key in a 2KB block size you minimise the pressure on the buffer cache by putting that table into its own 2KB buffer cache. The fact that you have to predefine the size of the table/cluster doesn’t matter if the tpc-c test doesn’t worry about what happens “tomorrow” as more data arrives. (Note: I recall that in one 10g or 11g tpc-c test Oracle used a “future” feature of range partitioned hash clusters – which finally appeared in 12c)





June 15, 2011

Block size

Filed under: Block Size,Infrastructure,Oracle — Jonathan Lewis @ 5:55 pm BST Jun 15,2011

I knew that failing to have a db_Nk_cache_size setting for your database could cause a statement to crash when it tried to address an object (or tablespace) using a non-standard block size, reporting errors like:

    ORA-29339: tablespace block size 16384 does not match configured block sizes”
    ORA-00379: no free buffers available in buffer pool DEFAULT for block size 16K

Here’s an interesting variation on the theme, reported in a note on the OTN database forum. Note particularly the ORA-603 and ORA-604 that wrap the ORA-379; and that the user states that the problem cache is the standard block size for the database. Unfortunately we never saw a resolution to this thread – perhaps it was simply a case of a cache that was too small when the database got very busy.

Footnote: a database can fail to open if it needs to do recovery in a tablespace for which there is no buffer set. Of course this is only likely to happen if you’re running with an init.ora file and have created a non-standard cache with ‘alter system’ calls while the database was previously up. Here’s an extract from an alert log showing the type of report you get:

Fri May 20 17:58:38 2011
Beginning crash recovery of 1 threads
 parallel recovery started with 2 processes
Started redo scan
Completed redo scan
 374 redo blocks read, 98 data blocks need recovery
Fri May 20 17:58:40 2011
Slave exiting with ORA-379 exception
Errors in file c:\oracle\diag\rdbms\d11g\d11g\trace\d11g_p000_2056.trc:
ORA-00379: no free buffers available in buffer pool  for block size 16K
Aborting crash recovery due to slave death, attempting serial crash recovery
Beginning crash recovery of 1 threads
Started redo scan
Completed redo scan
 374 redo blocks read, 98 data blocks need recovery
Aborting crash recovery due to error 379
Errors in file c:\oracle\diag\rdbms\d11g\d11g\trace\d11g_ora_3536.trc:
ORA-00379: no free buffers available in buffer pool  for block size 16K
ORA-379 signalled during: ALTER DATABASE OPEN...

September 25, 2009


Filed under: ASSM,Block Size,Bugs,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 6:03 pm BST Sep 25,2009

There are times as I browse through Metalink when I see descriptions of bugs that make me wonder how on earth anyone managed to find them. There are bugs which are so bizarre in their combination of prerequisites that you might think they’d never,ever,  show up. Here’s one that got a mention on OTN some time back.

Problem: an update on a simple, unindexed, table takes 90 minutes if the table is in a tablespace using a 16KB block size; but closer to 90 seconds if the table is in a tablespace using a 4KB block size. The effect is totally reproducible.

May 21, 2009

Row Directory

Filed under: Block Size,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 8:52 am BST May 21,2009

Yesterday I asked the question: how many row entries can you create in an 8KB block with pctfree 0. It brought out some interesting observations – including one I hadn’t thought of relating to a small difference between ASSM and freelist management.

I suspect, however, that most people didn’t quite realise the significance of the wording: “row entries”– I  wasn’t asking about “rows”. So here’s a little script you can run after you create the table, followed by the result of dumping the first block in the table.

March 22, 2009

Block size – again

Filed under: Block Size,Infrastructure,Performance,trace files,Troubleshooting,undo — Jonathan Lewis @ 7:09 pm GMT Mar 22,2009

Here’s a little oddity that I came across at a client site recently.

The client called me in because they were having problems with Oracle error “ORA-01555: snapshot too old” appearing in a particular task after a few thousand seconds (typically 5,000 to 6,000) even though they had set the undo_retention to 14,400 seconds and had a huge undo tablespace running with autoextend enabled on the data files.


October 28, 2008

IOTs and blocksize

Filed under: Block Size,Infrastructure,IOT,Oracle,Performance,Tuning — Jonathan Lewis @ 7:17 pm GMT Oct 28,2008

A question came up on the Oracle database forum a few months ago asking:

What are the benefits and the downside of using IOTs on 16k blocks? Would you recommend it?

I think the best response to the generic question about block sizing came from Greg Rahn in another thread on the forum:

If someone has to ask what block size they need. The answer is always 8KB.[1]


July 19, 2008

Block Sizes

Filed under: Block Size,Infrastructure,Troubleshooting — Jonathan Lewis @ 7:56 pm BST Jul 19,2008

There is a recent thread on the OTN forum [ed Jan 2009: the link is temporarily broken] about block sizing and the impact it may have on performance. The thread contains some interesting insights into the problems of testing and trouble-shooting. Unfortunately the thread has accumulated so much extraneous material that it has become hard to separate the wheat from the chaff, so I thought I’d make it easy to follow a couple of useful discussions that ran through the thread by supplying a sequence of URLs that you can visit in turn.

[Ed Jan 2009]
According to this comment from Justin Kestelyn, extremely long messages were causing performance problems in the current version of the forum software and have have been temporarily archived to a separate location – and this thread contained some very long messages. Since it may be some time before it reappears on the forum I’ve posted a copy made just before it disappeared so that (until the original reappears) you can read the interesting bits here. Warning the file is a pdf of about 3MB – the file will open in a separate window so that you can read the file and my comments about the file at the same time. When I view the file it runs to 218 pages, so I’ve added
page references after the links in the blog note (the links, of course, do not currently work).

(When first created the document was 233 pages long, but following a request from Don Burleson I removed a couple of entries that quoted material from his website, then following a request from Janet Burleson I removed all entries made by Don Burleson, so some of the page references may still be a little inaccurate.)


July 22, 2007


Filed under: Block Size,Infrastructure,Troubleshooting — Jonathan Lewis @ 7:48 pm BST Jul 22,2007

A comment I always make in my seminar about Oracle technology is that the best way to find bugs is to combine two different bits of the technology.  It’s easy to say, of course, but how can you possibly guess which combinations are going to produce bugs.

Here’s a really surprising one:  Bug no: 3469992 – fixed in 10.2 (Mar 2006) so probably no longer relevant to most people.

Rediscovery Information: 
        If flashback query fails with ORA-600[15201] and db_block_size is  
        bigger than 8K, it is due to this problem.    

        using smaller block_size 

How on earth are you supposed to guess that flashback query and blocksizes would have anything to do with each other ? After all, flashback query is just read-consistency from a different perspective – it couldn’t possibly go wrong !

Website Powered by