I have been known to criticize “meaningless key” syndrome from time to time. I’m not going to go through the arguments here, but I’d like to tell you about a lovely example of how badly things can go wrong when you follow that route.
Table4 is at the end of a sequence of referential integrity constraints; in fact, rather than the usual 1-to-many relationships, it’s almost a 1-to-1 all the way from table1 to table2 to table3 to table4 with just a few cases where their data volume multiplies up.
Area_code is part of the (undeclared) primary key of table1, which should have cascaded all the way to table4. But everything is done with meaningless keys generated by Oracle sequences.
So when a query like “select * from table4 where area_code = ‘EMEA’ and …” is needed, it has to be written as a four-table join, because the area_code never made it to table4.
As a result, the query has to pick up 500,000 rows from table4 and then join through three tables to eliminate 400,000 of them. [See footnote]Although the near sequential arrival of the data meant that the required data set would be scattered across a relatively small number of blocks, and offered a terrific caching benefit to a nested loop strategy, the optimizer doesn’t detect such time-based caching benefits and chose a series of very expensive tablescans and hash joins to produce the result. (This looks like an ideal case where it would be nice to use a “localised” setting for the optimizer_index_cost_adj).
Performance is not very good. Bad luck !
Pragmatic solution – put a stack of hints into the SQL.
Alternative pragmatic solution – write a piece of SQL to set the clustering_factor on the three critical indexes back to a reasonable value – but that needs some careful research as it might have nasty side-effects in other bits of code, and it needs to be injected into the right place in the existing routine that generates statistics.
“Ideal” solution – redesign the system and reload a couple of terabytes of data !!! Perhaps not.
Foot note(April 2009):
Originally my comment about the choice of path had been: Worse still, because of the arrival times of the data, the “meaningless primary key” indexes have a misleading clustering_factor and look like rubbish to the optimizer – so the execution path uses three consecutive hash joins, doing tablescans on three very large tables.
But following an email request for a test case to demonstrate what I meant by this comment, I realised that the comment was irrelevant in this case. The cost for a join into the primary key ignores the clustering_factor, and in this case the join into the foreign key was “nearly one to one”, so the impact of the clustering_factor on the cost was irrelevant.
The big difference between my opinion of the sensible execution path and the optimizer’s opinion was that I could see the time-based caching benefit – and the optimizer’s nested loop costing did not allow for it.