Here’s an interesting question from the OTN database forum:
“If I delete 90% of the rows from a table which has a few indexes, without rebuildling or coalescing indexes afterwards, will this improve the performance of index range scans ?”
The thing that makes it interesting is the scope it gives you for imagining reasons why the performance won’t change, or might get better, or could get worse. So how about it – can you think of an argument for each of the three possibilities ?
Update – 15th June
The first thing you need to do when you see a question like this is start asking for more information – as the first comment did.
- How big is the table – percentage are generally useless unless you also know the scale, and it’s possible that pure size might make a difference.
- What’s the pattern of data in the table – that probably matters
- Is the table partitioned – and does the delete pattern vary with partition
- What’s the pattern of data across the indexes, that could make a lot of difference
- Are any of the indexes partitioned (whether or not the table is)
- Are all the indexes plain B-tree indexes – or bitmaps, or context, or spatial; and function-based indexes, reverse key indexes …
- Are you deleting the data in lots of tiny batches with commits, or one big delete
- And yes, as asked in the first comment – are you going to collect statistics before running queries, because that could make a big difference.
- Timing is quite important – are you thinking only in the very short term (delayed block cleanout effects), or in the longer, stabilised term.
- Are you assuming the data you delete is data that was not previously returned in the queries – but would that always matter anyway
- And why are you asking about range scans, not general purpose queries – that rather avoids the problem of how query plans can change as stats change.
Maybe we could make some assumptions – though that can have dangerous consequences. We might guess that the driving idea behind this question is that there’s a load of old data that isn’t being queried any more and if we simply delete it our queries will, at worst, still show the same performance. So I’m going to take the line that (a) we’re expecting any queries to return the same result and (b) we’re going to be inserting new data over time.
Just picking up a few of the comments that have appeared so far we can see how many details you might have to consider to think this question properly.
Comment 1 says: “how about collecting stats” – it seems likely that new stats would appear soon (90% deletion is more than a 10% change, so if auto stats collection hasn’t been disabled …). What if, for some indexes, 100% of the data has been deleted from 90% of the blocks, then the leaf_block count of an index will change by a factor of 10 (and that’s effectively a bug, by the way): a possible consequence is that Oracle will decide that an index fast full scan is appropriate in cases where it previously took a different path.
Comment 3 picks up the case that if you had an index based on columns where the deleted data had null values in the relevant columns, then the 90% table deletion wouldn’t affect that index at all. This may sound a little extreme – but as a minor variation on the theme, the idea of creating function-based indexes which hold values for “recent” data is quote well-known and very effective; the argument might be totally appropriate.
Comment 5 makes a fairly subtle and important point. If I used to have a range scan that examined 1,000 rows and returned 100 and the 90% of rows I deleted was exactly the 900 rows I didn’t return then there are two outcomes. If I had to visit the table 1,000 times to identify the 100 rows then the query will now be quicker; if I had been able to identify the 100 rows by examining 1,000 index entries then the query performance will be pretty much unchanged.
Comment 5 also has a rather nice thought on stopkeys (rownum <= N) – but it’s similar to Richard Foote’s min() example in comment 4 – I think it would only apply if the result were going to change.
Comment 9 makes an interesting point. If you delete a lot of data from a table you have no idea (until you look at the data patterns carefully, or until after you’ve collected the stats) of how the clustering_factor of the index will change. It will probably drop – though technically it could stay the same – but how far it drops compared to the change in the number of rows left in the table could make a big difference to the “selectivity * clustering_factor” bit of the index range scan calculation. Bear in mind that if your queries are returning the same result before and after the delete then your selectivities must have gone up by a factor of 10 because you’re returning the same volume of data from a total volume that has decreased by a factor of 10. (I’m assuming that the optimizer gets its stats right when I say this, of course).
Comment 6 brings up an important feature of bitmap indexes. When you delete a row from a table you delete the corresponding row from a b-tree index (eventually), but you update a bitmap index chunk, which means generating a new bitmap index chunk with one bit changed. So deleting 90% of the table data, however you did it, could result in a doubling of every bitmap index on the table. (Results will vary with version of Oracle and the strategy used to delete the data)
Several comments picked up the issue of “delayed block cleanout” that’s likely to have an impact for a while after the big delete. There might be other intermittent effects later on, depending where new data goes and how many leaf blocks were completely emptied by the delete – it is possible for some odd locking waits to appear as Oracle tries to find an empty block that can legally be moved for the split.
Personally, and ignoring the bitmap index threat, I think it’s safe to say that if your indexes are very well designed (so that eliminates most systems I’ve seen), then most queries will probably perform almost as efficiently after the delete (and delayed block cleanout) as they did before - provided the optimizer still picks the same execution plan: but the moment you collect up to date statistics you may see some staggering changes (in either direction) for some queries. (And if you don’t collect up to date statistics at some point you will also seem some staggering changes as Oracle starts bring in “out of range” adjustments to costs.
Bottom line: If you’re going to delete 90% of the data in a large table then you’ve got to think careful about how to minimise the side effects.