This question came up recently on the Oracle-L list-server
We need to store approx. 300 GB of data a month. It will be an OLTP system. We want to use commodity hardware and open source database. We are willing to sacrifice performance for cost. e.g. a single row search from 2 billion rows table should be returned in 2 sec.
The author isn’t asking about Oracle specifically but, irrespective of the platform, it’s a question that prompts some interesting thought experiments. Imagine creating a high-precision index on a table of about that size, and see where that thought leads you.
If the key is about 16 bytes long, and assuming you are using 8KB blocks, and achieve a fairly typical 70% efficiency (space usage per block) on the index, then you can get about 200 entries per block (see comment 11 for explanation). It’s only a rough estimate, and there are various differences between leaf block and branch block usage to make this figure arguable – but at present I’m only after a ball-park figure.
At 200 entries per block, and 8KB blocks you get:
Root 1 block => 200 entries 8 KB Branch 1 200 blocks => 40,000 entries 1.6 MB Branch 2 40,000 blocks => 8,000,000 entries 320 MB Leaf 8,000,000 blocks => 1,600,000,000 entries 64 GB
So, for 1.6 billions rows in a table, you have an index which allows you to visit “anyone one row” after four steps down the index and a jump to the table.
The question is, how many disk reads is that likely to take? The whole index is 64GB, and that may be a little large to cache especially since it’s likely to be competing with a few other indexes; but if this is supposed to be a “big system”, it would be reasonable to assume that the branch levels could be completely cached. (Note that the branch levels account for 0.5% of the total size of this index – and that’s a fairly typical figure for B-tree indexes). So a single row access would require 3 cache hits (root, branch 1, branch 2) followed by just two disk accesses. (You might like to think very carefully about how the figures would change if you tried “tuning” the index tree structure by using a larger blocksize.)
On modern hardware we could expect to achieve about 100 I/Os per second for “very random” reads – so our single row query should take no more than about 1/50th of a second.
But some indexes will not be unique – let’s assume (for the sake of the thought experiment) a fairly precise 24 rows per key value – so what will that do to the I/O. We get 200 key entries per leaf block, so it’s likely that we will still read just one leaf block: the problem is now a question of how many table blocks we will read. It might be anything between 1 and 24 – in the worst (and typically usual) scenario, it will be that 24 reads – for a total of 25 read requests: one quarter of a second.
Even that one-quarter of a second sounds good for the original poster – but the argument so far helps to identify two issues. First – the response time is will be hugely dependent on both the precision of the index and the scattering of the table data. If each index probe hits 200 rows, and they are widely scattered, you read (probably) two index leaf blocks and 200 table blocks – note how the I/O is all about the table, not the index – for a total time of just over 2 seconds.
Secondly – if this is a big system, there are probably lots of users – that is, after all, one of the “environmental” factors that could cause 300GB of data to appear per month. So how many concurrent queries does the system have to face. At just 25 I/Os per query, we can manage a maximum of 4 queries per second taking a single disk to the limit of its I/O capacity. Put another way – multiply your number of physical disk devices by four, and that’s the absolute maximum number of concurrent queries your system can cope with. (And that’s assuming fairly high precision of 25 rows visited per query, with no interfering I/O from inserts, updates, deletes, read-consistency, logging, archive logging, sorting etc.).
When it comes down to the bottom line, the ability to deal with data sizes and random queries has very little to do with the software you’re running – the first considerations are:
- Optimising the data packing and indexing
- Getting enough disk drives (i.e. lots)
- Getting enough local memory to cache lots of index blocks – ideally the minimum would be ALL the branch blocks
- Controlling the queries to minimise I/O
And just in case you’re thinking that a SAN with a massive cache would address these issues – it won’t.
SANs can move large amounts of data around very quickly – but don’t work well with very large numbers of extremely random I/Os. Any cache benefit you might have got from the SAN has already been used by Oracle in caching the branch blocks of the indexes. What the SAN can give you is a write-cache benefit for the log writer (but see note 7 for a pingback from Kevin Closson), database writer, and any direct path writes and re-reads from sorting and hashing.