Thinking Big

March 18, 2007

Thinking Big

Filed under: Indexing,Infrastructure,Tuning — Jonathan Lewis @ 9:38 pm GMT Mar 18,2007

This question came up recently on the Oracle-L list-server

We need to store approx. 300 GB of data a month. It will be an OLTP system. We want to use commodity hardware and open source database. We are willing to sacrifice performance for cost. e.g. a single row search from 2 billion rows table should be returned in 2 sec.

The author isn’t asking about Oracle specifically but, irrespective of the platform, it’s a question that prompts some interesting thought experiments. Imagine creating a high-precision index on a table of about that size, and see where that thought leads you.

If the key is about 16 bytes long, and assuming you are using the (fairly standard) 8KB block size and achieve a fairly typical 70% efficiency (space usage per block) on the index, then you will get about 200 entries per block (see comment 11 for explanation of this estimate). It’s only a rough estimate and there are various differences between leaf block and branch block usage to make this figure a significant under-estimate for the branch blocks – but at present I’m only after a ball-park figure.

At 200 entries per block, and an 8KB block size you get:

Root                    1 block  =>            200 entries        8   Kilobytes
Branch level 1        200 blocks =>         40,000 entries        1.6 Megabytes
Branch level 2     40,000 blocks =>      8,000,000 entries      320   Megabytes
Leaf level      8,000,000 blocks =>  1,600,000,000 entries       64   GigaBytes

So, for 1.6 billions rows in a table, you have an index which allows you to visit “any one row” after four steps down the index and a jump to the table.

Note: Oracle actually counts the branch levels from the bottom up – because that’s the way that index block splits propagate – so a treedump would label this list with Branch level 3, Branch level 2, Branch level 1, Leaf.

So how many disk reads is it likely to take to read “any one row”? The whole index is 64GB, and that may be a little large to cache especially since it’s likely to be competing with a few other indexes; but if this is supposed to be a “big system” it would be reasonable to assume that the branch levels could be completely cached. (Note that the branch levels account for 0.5% of the total size of this index – and that’s a fairly typical figure for B-tree indexes). So a single row access would require 3 cache hits (root, branch 1, branch 2) followed by just two disk accesses (leaf, table). (You might like to think very carefully about how the figures might change if you tried “tuning” the index tree structure by using a larger blocksize.)

On modern hardware we could expect to achieve about 100 I/Os per second for “very random” reads – so our single row query should take no more than about 1/50th of a second.

But some indexes will not be unique – let’s assume (for the sake of the thought experiment) a fairly consistent 24 rows per key value – so what will that do to the I/O. We get 200 key entries per leaf block, so it’s likely that we will still read just one leaf block: the problem is now a question of how many table blocks we will read. It might be anything between 1 and 24 – in the worst (and very common) scenario it will be 24 reads – for a total of 25 read requests: which equates to one quarter of a second.

Even that one-quarter of a second sounds good for the original poster – but the argument so far helps to identify two issues. First – the response time will be hugely dependent on both the precision of the index and the scattering of the table data. If each index probe hits 200 rows that are widely scattered, you read (probably) two index leaf blocks and 200 table blocks – note how the I/O is all about the table, not the index – for a total time of just over 2 seconds.

Secondly – if this is a big system there are probably lots of users – that is, after all, one of the “environmental” factors that could cause 300GB of data to appear per month. So how many concurrent queries does the system have to face. At just 25 I/Os per query, we can manage a maximum of 4 queries per second taking a single disk to the limit of its I/O capacity. Put another way – multiply your number of physical disk devices by four, and that’s the absolute maximum number of concurrent queries your system can cope with. (And that’s assuming fairly high precision of 25 rows visited per query, with no interfering I/O from inserts, updates, deletes, read-consistency, logging, archive logging, sorting etc.).

When it comes down to the bottom line, the ability to deal with data sizes and random queries has very little to do with the software you’re running – the first considerations are:

Optimising the data packing and indexing
Getting enough disk drives (i.e. lots)
Getting enough local memory to cache lots of index blocks – ideally the minimum would be ALL the branch blocks
Controlling the queries to minimise I/O

And just in case you’re thinking that a SAN with a massive cache would address these issues – it won’t.

SANs can move large amounts of data around very quickly – but don’t work well with very large numbers of extremely random I/Os. Any cache benefit you might have got from the SAN has already been used by Oracle in caching the branch blocks of the indexes. What the SAN can give you is a write-cache benefit for the log writer (but see comment 7 for a pingback from Kevin Closson), database writer, and any direct path writes and re-reads from sorting and hashing.

Comments (12)

12 Comments »

Jonathan, pretty good explanation.
I guess, that’s where thin IOT’s possibly come into place… of course, some interesting engineering will have to take place but coupled with the compact/random factor of the ever changing data & possible the data block visits (on an otherwise non-IOT solution), umm…. that makes it for an pretty cool exercise for the owner of the project :-)

Comment by Cos — March 19, 2007 @ 2:25 am GMT Mar 19,2007 | Reply
your ball-park figures are pretty good.
I have a real-world system like that here, and the three billion-row index is taking little more than 75GB (key compressed, partitioned IOT)

response times are fairly consistent with what you predicted.

Comment by dba — March 19, 2007 @ 9:18 am GMT Mar 19,2007 | Reply
What the SAN can give you is a write-cache benefit for the log writer, database writer, and any direct path writes and re-reads from sorting and hashing But only within the limits of the cache size – stray into the massive sorts and hashes of the data warehouse world and things can slow down, even more so for the RAID 5 configured SAN user

Comment by Peter Scott — March 19, 2007 @ 12:30 pm GMT Mar 19,2007 | Reply
Would using a SSD like BitMicro or RamSan help?

Comment by Amit Kulkarni — March 19, 2007 @ 1:01 pm GMT Mar 19,2007 | Reply
Amit, probably not. The figures are showing 64GB just to buffer one index completely. How much SSD would you have to buy to buffer all the indexes you want, just to eliminate one I/O per probe from the cost ? How much SSD would you need to buffer the whole (row-size unspecified) table in order to eliminate the significant volume of I/Os ?

The whole SSD argument is a little suspect when you start talking about big databases. In fact, you should take a look at Kevin Closson’s blog for some interesting comments on CPU stalling, and memory access speeds to see just how misleading the SSD argument can be.

Comment by Jonathan Lewis — March 19, 2007 @ 9:39 pm GMT Mar 19,2007 | Reply
Cos, yes – IOTs for “thin” tables are an option – but the trade-offs are the extra I/O caused by the randomness of inserts, the smaller number of rows per block, and the larger size of secondary indexes. Getting the cost/benefit balance right can be tricky.

DBA, thanks for the feedback – the IOT always add an interesting element to the sizing / resourcing problem.

Comment by Jonathan Lewis — March 19, 2007 @ 10:14 pm GMT Mar 19,2007 | Reply
[…] Lewis has taken on a recent Oracle-l thread about Thinking Big. It’s a good blog entry and I recommend giving it a read. The original Oracle-l post read like […]

Pingback by SAN Array Cache and Filers Hate Sequential Writes « Kevin Closson’s Oracle Blog: Platform, Storage & Clustering Topics Related to Oracle Databases — March 20, 2007 @ 5:07 pm GMT Mar 20,2007 | Reply
What advantges do skinny tables provide? I currently have a database consisting of 84 data elements per employee. I am considering creating a skinny table for each and every data element( index field, data value). My database has approximately 150 million records whcih spans a 20 year period of time.

I believe the skinny table design will significantly improve the performance of all my applications (more responsive to queries) at a major sacrifice of utilizing more disk storage just for indices. My estimate is 90% of the available storage ( 800 Gigabytes) will be used for indices alone and 6% will be the actual data values

Appreciate any additional comments of pros and cons for use of skinny tables.

Comment by dave — April 6, 2007 @ 12:07 pm BST Apr 6,2007 | Reply
Dave,

Whatever else this does, it will probably add a significant overhead to data maintenance. The “skinny table” approach is one that should be used as carefully as the “add another index” approach. i.e. weighed carefully in terms of resource costs and benefits.

If you really have 84 “atomic” data items per employee, then you ought to have 84 data columns in a table. You might then use row-level triggers to populate a few skinny tables with the subset of columns used by a few very popular queries.

I certainly wouldn’t consisder creating 84 tables with one data item per table – the join costs of putting together the data for even a simple query would be prohibitive.

Comment by Jonathan Lewis — April 8, 2007 @ 4:00 pm BST Apr 8,2007 | Reply
[…] key index has a height of four but still fits comfortably in 18G. Based on the example shown in this entry, 6G of memory would be adequate for such a index if you had a 16 byte key and the index was […]

Pingback by Hit Ratios (4) « Oracle Scratchpad — September 26, 2007 @ 9:23 pm BST Sep 26,2007 | Reply
I’ve just referenced this note from an item on the Oracle forums, and noticed that it’s not obvious how I got from 16 byte keys to 200 entries per block when running at 70% usage.

You have to know that a standard b-tree index entry is actually going to be: key size + 2 bytes row overhead + 2 bytes row pointer + 6 bytes rowid + a possible one byte for rowid length for a total of 27 bytes. (And if you’ve forgotten to include length bytes in your key length, it may be a little more).

Comment by Jonathan Lewis — March 11, 2008 @ 8:33 pm GMT Mar 11,2008 | Reply
[…] entries in a single index segment the height may still be only 3 (blevel = 2), and it may take a few tens of millions of rows before an index needs to grow to height = 4 (blevel = […]

Pingback by OERI 6051 | Oracle Scratchpad — January 9, 2022 @ 9:27 pm GMT Jan 9,2022 | Reply