Oracle Scratchpad

March 18, 2007

Thinking Big

Filed under: Indexing,Infrastructure,Tuning — Jonathan Lewis @ 9:38 pm GMT Mar 18,2007

This question came up recently on the Oracle-L list-server

We need to store approx. 300 GB of data a month. It will be an OLTP system. We want to use commodity hardware and open source database. We are willing to sacrifice performance for cost. e.g. a single row search from 2 billion rows table should be returned in 2 sec.

The author isn’t asking about Oracle specifically but, irrespective of the platform, it’s a question that prompts some interesting thought experiments. Imagine creating a high-precision index on a table of about that size, and see where that thought leads you.

If the key is about 16 bytes long, and assuming you are using 8KB blocks, and achieve a fairly typical  70% efficiency (space usage per block) on the index, then you can get about 200 entries per block (see comment 11 for explanation). It’s only a rough estimate, and there are  various differences between leaf block and branch block usage to make this figure arguable – but at present I’m only after a ball-park figure.

At 200 entries per  block, and 8KB blocks you get:

Root  	          1 block  =>           200 entries	  8 KB
Branch 1        200 blocks =>        40,000 entries	  1.6 MB
Branch 2     40,000 blocks =>     8,000,000 entries	320   MB
Leaf	  8,000,000 blocks => 1,600,000,000 entries	 64   GB

So, for 1.6 billions rows in a table, you have an index which allows you to visit “anyone one row” after four steps down the index and a jump to the table.

The question is, how many disk reads is that likely to take? The whole index is 64GB, and that may be a little large to cache especially since it’s likely to be competing with a few other indexes; but if this is supposed to be a “big system”, it would be reasonable to assume that the branch levels could be completely cached. (Note that the branch levels account for 0.5% of the total size of this index – and that’s a fairly typical figure for B-tree indexes). So a single row access would require 3 cache hits (root, branch 1, branch 2) followed by just two disk accesses. (You might like to think very carefully about how the figures would change if you tried “tuning” the index tree structure by using a larger blocksize.)

On modern hardware we could expect to achieve about 100 I/Os per second for “very random” reads – so our single row query should take no more than about 1/50th of a second.

But some indexes will not be unique – let’s assume (for the sake of the thought experiment) a fairly precise 24 rows per key value – so what will that do to the I/O. We get 200 key entries per leaf block, so it’s likely that we will still read just one leaf block: the problem is now a question of how many table blocks we will read. It might be anything between 1 and 24 – in the worst (and typically usual) scenario, it will be that 24 reads – for a total of 25 read requests: one quarter of a second.

Even that one-quarter of a second sounds good for the original poster – but the argument so far helps to identify two issues. First – the response time is will be hugely dependent on both the precision of the index and the scattering of the table data. If each index probe hits 200 rows, and they are widely scattered, you read (probably) two index leaf blocks and 200 table blocks – note how the I/O is all about the table, not the index – for a total time of just over 2 seconds.

Secondly – if this is a big system, there are probably lots of users – that is, after all, one of the “environmental” factors that could cause 300GB of data to appear per month. So how many concurrent queries does the system have to face. At just 25 I/Os per query, we can manage a maximum of 4 queries per second taking a single disk to the limit of its I/O capacity. Put another way – multiply your number of physical disk devices by four, and that’s the absolute maximum number of concurrent queries your system can cope with. (And that’s assuming fairly high precision of 25 rows visited per query, with no interfering I/O from inserts, updates, deletes, read-consistency, logging, archive logging, sorting etc.).

When it comes down to the bottom line, the ability to deal with data sizes and random queries has very little to do with the software you’re running – the first considerations are:

  • Optimising the data packing and indexing
  • Getting enough disk drives (i.e. lots)
  • Getting enough local memory to cache lots of index blocks – ideally the minimum would be ALL the branch blocks
  • Controlling the queries to minimise I/O

And just in case you’re thinking that a SAN with a massive cache would address these issues – it won’t.

SANs can move large amounts of data around very quickly – but don’t work well with very large numbers of extremely random I/Os. Any cache benefit you might have got from the SAN has already been used by Oracle in caching the branch blocks of the indexes. What the SAN can give you is a write-cache benefit for the log writer (but see note 7 for a pingback from Kevin Closson), database writer, and any direct path writes and re-reads from sorting and hashing.


  1. Jonathan, pretty good explanation.
    I guess, that’s where thin IOT’s possibly come into place… of course, some interesting engineering will have to take place but coupled with the compact/random factor of the ever changing data & possible the data block visits (on an otherwise non-IOT solution), umm…. that makes it for an pretty cool exercise for the owner of the project :-)

    Comment by Cos — March 19, 2007 @ 2:25 am GMT Mar 19,2007 | Reply

  2. your ball-park figures are pretty good.
    I have a real-world system like that here, and the three billion-row index is taking little more than 75GB (key compressed, partitioned IOT)

    response times are fairly consistent with what you predicted.

    Comment by dba — March 19, 2007 @ 9:18 am GMT Mar 19,2007 | Reply

  3. What the SAN can give you is a write-cache benefit for the log writer, database writer, and any direct path writes and re-reads from sorting and hashing But only within the limits of the cache size – stray into the massive sorts and hashes of the data warehouse world and things can slow down, even more so for the RAID 5 configured SAN user

    Comment by Peter Scott — March 19, 2007 @ 12:30 pm GMT Mar 19,2007 | Reply

  4. Would using a SSD like BitMicro or RamSan help?

    Comment by Amit Kulkarni — March 19, 2007 @ 1:01 pm GMT Mar 19,2007 | Reply

  5. Amit, probably not. The figures are showing 64GB just to buffer one index completely. How much SSD would you have to buy to buffer all the indexes you want, just to eliminate one I/O per probe from the cost ? How much SSD would you need to buffer the whole (row-size unspecified) table in order to eliminate the significant volume of I/Os ?

    The whole SSD argument is a little suspect when you start talking about big databases. In fact, you should take a look at Kevin Closson’s blog for some interesting comments on CPU stalling, and memory access speeds to see just how misleading the SSD argument can be.

    Comment by Jonathan Lewis — March 19, 2007 @ 9:39 pm GMT Mar 19,2007 | Reply

  6. Cos, yes – IOTs for “thin” tables are an option – but the trade-offs are the extra I/O caused by the randomness of inserts, the smaller number of rows per block, and the larger size of secondary indexes. Getting the cost/benefit balance right can be tricky.

    DBA, thanks for the feedback – the IOT always add an interesting element to the sizing / resourcing problem.

    Comment by Jonathan Lewis — March 19, 2007 @ 10:14 pm GMT Mar 19,2007 | Reply

  7. [...] Lewis has taken on a recent Oracle-l thread about Thinking Big. It’s a good blog entry and I recommend giving it a read. The original Oracle-l post read like [...]

    Pingback by SAN Array Cache and Filers Hate Sequential Writes « Kevin Closson’s Oracle Blog: Platform, Storage & Clustering Topics Related to Oracle Databases — March 20, 2007 @ 5:07 pm GMT Mar 20,2007 | Reply

  8. What advantges do skinny tables provide? I currently have a database consisting of 84 data elements per employee. I am considering creating a skinny table for each and every data element( index field, data value). My database has approximately 150 million records whcih spans a 20 year period of time.

    I believe the skinny table design will significantly improve the performance of all my applications (more responsive to queries) at a major sacrifice of utilizing more disk storage just for indices. My estimate is 90% of the available storage ( 800 Gigabytes) will be used for indices alone and 6% will be the actual data values

    Appreciate any additional comments of pros and cons for use of skinny tables.

    Comment by dave — April 6, 2007 @ 12:07 pm GMT Apr 6,2007 | Reply

  9. Dave,

    Whatever else this does, it will probably add a significant overhead to data maintenance. The “skinny table” approach is one that should be used as carefully as the “add another index” approach. i.e. weighed carefully in terms of resource costs and benefits.

    If you really have 84 “atomic” data items per employee, then you ought to have 84 data columns in a table. You might then use row-level triggers to populate a few skinny tables with the subset of columns used by a few very popular queries.

    I certainly wouldn’t consisder creating 84 tables with one data item per table – the join costs of putting together the data for even a simple query would be prohibitive.

    Comment by Jonathan Lewis — April 8, 2007 @ 4:00 pm GMT Apr 8,2007 | Reply

  10. [...] key index has a height of four but still fits comfortably in 18G. Based on the example shown in this entry, 6G of memory would be adequate for such a index if you had a 16 byte key and the index was [...]

    Pingback by Hit Ratios (4) « Oracle Scratchpad — September 26, 2007 @ 9:23 pm GMT Sep 26,2007 | Reply

  11. I’ve just referenced this note from an item on the Oracle forums, and noticed that it’s not obvious how I got from 16 byte keys to 200 entries per block when running at 70% usage.

    You have to know that a standard b-tree index entry is actually going to be: key size + 2 bytes row overhead + 2 bytes row pointer + 6 bytes rowid + a possible one byte for rowid length for a total of 27 bytes. (And if you’ve forgotten to include length bytes in your key length, it may be a little more).

    Comment by Jonathan Lewis — March 11, 2008 @ 8:33 pm GMT Mar 11,2008 | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at


Get every new post delivered to your Inbox.

Join 4,308 other followers