When I arrived in Edinburgh for the UKOUG Scotland conference a little while ago Thomas Presslie, one of the organisers and chairman of the committee, asked me if I’d sign up on the “unconference” timetable to give a ten-minute talk on something. So I decided to use Hybrid Columnar Compression to make a general point about choosing and testing features. For those of you who missed this excellent conference, here’s a brief note of what I said.
The first point that I had to stress was that I was going to describe a total disaster that appeared when I was doing some work with Exadata, but what I was doing was a deliberate attempt to cause a problem, and the outcome was the sort of thing I had been expecting. The problem I had wasn’t an Exadata problem, it was an investigative test. A critical feature of using Exadata is that it changes the “balance of power” between using indexes and doing tablescans and makes it much harder (at least initially) to decide which columns should be indexed.
I created my disaster with a simple query that used an index when it should have been using a tablescan – so not a new problem, just a variation of the commonest optimizer question (in reverse) that gets asked on the Internet. I had created a data set of about 32 million rows (actually 225)using archive high compression, and then I had created a bitmap index on one of the columns as the first step in setting up some tests of star transformations. Having got the first index in place, though, I decided to run the following aggregate query before building any more indexes:
select max(padding) from t1 where n_128k between 1000 and 1999 ;
As you might guess from the column name my index is on a column that has (roughly) 128,000 distinct values, and my query is asking for data relating to 1,000 of those values. I generated the column, as I often do, using the mod() function so the rows were evenly spread (in terms both of quantity and scattering) throughout the table – and I’m after one row in 128 for a total of about 256,00 rows out of my 32 million.
The optimizer chose to use an indexed access path for this query and took about 2,700 CPU seconds to complete (that’s about 1/100 CPU second per row). When hinted to do a tablescan the query completed in 22 (CPU) seconds.
For what it was doing, the machine was working really quickly when it used the index – but it shouldn’t have used the index. The compression was so good that I had roughly 32,000 rows packed into compression units of 32 blocks each – and the nature of my query (combined with the nature of my data) required Oracle to walk a 32-block CU and decompress one column of that CU for each of the 256,000 rows that I wanted.
The optimizer code has no idea about how much CPU it’s going to use on decompression – all it could see was that it needed quite a lot of rows, but they were packed into a fairly small space. (The fact that this was a bitmap index, with it’s 80/20 packing assumption, didn’t help either – a B*tree index using an estimated clustering_factor might have made a better choice.)
WARNING: I have heard several stories about people thinking that you don’t need ANY indexes if you’re using Exadata – even if you’re using it for a datawarehouse system. That suggestion is WRONG. You will probably decide that you need fewer indexes, you may even manage to get to the point where your system performs sufficiently well with none at all, but you should not assume that a move to Exadata automatically means a move to no indexes. Hybrid Columnar Compression makes an enormous difference to the desirability of specific indexes (as can Storage indexes), but you still need to calculate the pros and cons of each index individually.