There’s a current question on the OTN database forum as follows (with a little cosmetic adjustment):
I have a request for 3 indices as shown below. Does the 1st index suffice for 2 and 3? Do I need all 3?
- create index idx_atm_em_eff_ver_current_del on atm_xxx_salary (employee_key, effective_dt, salary_version_number, is_current, is_deleted);
- create index idx_atm_em_ver_current on atm_xxx_salary (employee_key, salary_version_number, is_current);
- create index .idx_atm_sal_current_del on atm_xxx_salary (employee_key, is_deleted, is_current);
In the absence of any information about the data and the application the correct answer is: “How could we possibly tell?”
On the other hand there’s plenty of scope for intelligent speculation, and that’s an important skill to practise because when we’re faced with a large number of options and very little information we need to be able to make best-guess choices about which ones are most likely to be worth the effort of pursuing in detail. So if we have to make some guesses about this table and the set of indexes shown, are there any reasonable guesses we might make.
I’ve highlighted the table name and leading column for the first index. The table seems to about salary and the leading column seems to identify an employee. In fact we see that all three indexes start with the employee_key and that may be what prompted the original question. Previous (“real-world”) experience tells me that employees are, generally, paid a salary and that salaries are likely to change (usually upwards) over time, and I note that another column in one of these indexes is effective_dt (date ?), and a third column (appearing in two of the indexes) is is_current.
This looks like a table of employee salaries recording their current and historic salaries, engineered with a little redundant information to make it easy to find the current salary. (Perhaps there’s a view of current_salary defined as is_current = ‘Y’ and is_deleted = ‘N’.)
It’s harder to speculate with any confidence on the columns is_deleted and salary_version_number; why would a salary row be marked as deleted – is this something that happens when an employee leaves or an employee is deleted (or, following the pattern, has their is_deleted flag set to ‘Y’); why does a salary have a version number – does the table contain it’s own audit trail of errors and corrections, perhaps a correction is effected by marking the incorrect entry as deleted and incrementing its version number to generate the version number for the correct entry. Possibly the notional primary key of the table is (employee_key, effective_dt, is_deleted, salary_version_number).
The level of complexity surrounding these two columns could send further speculation in completely the wrong direction, but let’s follow the line that these two columns see very little action – let’s assume that most of the data is not “deleted” and virtually none of the data needs “versioning”. How does this assumption help us with the original question.
The largest employer in the world is the America Department of Defence with 3.2 million employees (following by the People’s Liberation Army of China with only 2.3 million employees), so an “employees” table is not really likely to be very big. How often does an employee have a salary review and change ? Would once per year be a reasonable figure to pluck from the air ? How many employess stay at the same company for 40 years – how many rows per employee would you end up with, and how scattered would they be through the salary table ?
Under any reasonable estimate it seems likely that if you created the first index (5 columns) then all the salary rows for a given employee are likely to be contained in a single leaf block, so if all the searches were driven by employee then that single index would allow exactly the correct set of table rows to be identified from one index leaf block access plus a little extra CPU.
Of course it’s possible that, with different circumstances, the size and clustering factor of the first index would be so much greater than the size and clustering factors of the other two that a query that would use one of the smaller indexes won’t use the larger index -but in this case the most significant contributor to the optimizer’s cost is likely to be the clustering_factor and given our assumption of the slow appearance over time of the new salaries for an employee the clustering factor of all three indexes is likely to be the same (probably very similar to the number of rows in the salary table).
Having got this far, it’s worth considering whether or not the salary table should actually be an index-organized table – it looks like an obvious candidate; how many other columns are there likely to be in a salary table ? Of course it’s worth thinking about other queries that might access a salary table without reference to the employees table at that point, perhaps a secondary index on (is_current, employee_key) might be appropriate, but in the absence of any other information we’ve reached the point where speculation needs to be backed up by some hard facts.
I wouldn’t guarantee that the first index makes the other two indexes redundant but it seems highly likely that it should and it’s probably worth spending some time looking at the requirements and numbers a little more closely – especially if you’re the US DoD or the Chinese People’s Liberation Army.