I received an email today containing the following question:
For packaged applications, like the Oracle EBS, some indexes tend to develop a high Clustering Factor over time, like the one shown below:
BTREE LEAF DISTINCT CLUSTERING INDEX TABLE TABLE INDEX NAME LEVEL BLOCKS KEYS FACTOR NUM ROWS NUM BLKS NUM ROWS ---------- ----- ------ ---------- ---------- ---------- -------- ---------- XXX_PK 3 778150 77,842,100 17,163,350 77,842,100 865,805 77,043,200
We have some queries that perform range scans on this index and run slower because they have to visit too many blocks from the table. How should one deal with this type of index?
This question raises a couple of important points that are worth reviewing.
As a starting detail: did you notice that the number of “rows” in the index was greater than the number of rows in the table. This can happen when you set a sample size less than 100% to gather stats, and use the “cascade” option. Oracle often uses a larger sample on the index than the one specified for the table, with the type of results that you see above.
This isn’t a big deal – unless the clustering_factor also exceeds the number of rows in the table, at which point you may find that queries that should “obviously” be using the primary key index start using a slightly less desirable non-unique index. This is a side-effect of the cost calculation which uses the expression selectivity[tb_sel] * clustering_factor when assessing the suitability of the index.
If your primary key clustering_factor is larger than the number of rows in the table then, for a single column index, it will also be greater than the number of distinct values for the column, and the expression selectivity[tb_sel] * clustering_factor will be greater than one for the predicate primary_key_col = constant. If this happens then another index may appear to have a lower cost than the primary key index.
Coming back to the original question though – what do you do when you have an index where the clustering_factor keeps growing. This question prompts [previously "begs" - see comment 6 below] an even more important question: is the clustering_value telling the truth? There is the pattern to the data scatter that really exists, and there is the pattern that the optimizer thinks exists (and then there may be a pattern to the data scatter that you would like to fool the optimizer into believing).
If your data is well clustered then the clustering_factor should be “small” and the optimizer has a higher probability of using the index. If the data is well clustered but, for some reason, the clustering_factor is “large” then the optimizer may well ignore the index when you know it would be sensible to use it.
If your data is very poorly clustered then the clustering_factor should be “large” and the optimzer is likely to use the index only in special cases (such as unique scan, index-only, or bitmap conversion paths). If your data is very poorly clustered but the clustering_factor is “small” – possibly because you’ve hacked it – then the optimizer may well use the index when it shouldn’t and perform an excessive amount of (logical) I/O.
In this example, the clustering_factor is quite large, and the user is complaining that queries using index range scans are running slowly because Oracle is visiting too many blocks from the table.
This means there probably isn’t a problem with the clustering_factor – if you have to go to the table for the data, you have to go to the table for the data. Although the clustering factor may, in principle, be too small, the fact that we work harder and harder as time passes (i.e. as the data sizes grow) means the index has not been defined appropriately for the queries where it is being used.
As a simple example to demonstrate this point, consider a query involving customer data. We always ask for data for a customer for the last 4 weeks.
select hst.* from customer_history hst where hst.customer_id = 12345 and hst.tx_date > trunc(sysdate) - 28 order by hst.tx_date desc ;
If we have an index just on the (customer_id) column, then as time passes Oracle has to visit more and more table blocks, discarding data that is too old only after it has visited the table.
If we have an index on (customer_id, tx_date), then the work done visiting the table will stay constant as time passes, because the volume of data we examine is independent of the size of the history we have generated.
Of course, if we can recreate the table in our example as an index-cluster on (customer_id), we can minimise the number of excess blocks we visit (although we will still be visiting excessive numbers of rows) and this may be of some benefit.
Alternatively, if we can partition the table by (tx_date) our query will be able to use partition elimination to minimise the number of redundant table blocks we visit – provided the index on (customer_id) is a local index (although this may increase the number of index block visits to an undesirable level if we use a large number of partitions).
In conclusion: the clustering_factor can easily be too large and not reflect the true pattern of data scatter – this can make the optimizer ignore the index when it should be using it. But if the clustering_factor really does reflect the data scatter, and your queries get slower because they are visiting increasing numbers of blocks in the table, then you probably need extra columns in the index to allow you to eliminate some of the visits to the table.