Meaningless Keys

December 29, 2006

Meaningless Keys

Filed under: Infrastructure,Performance — Jonathan Lewis @ 6:07 pm GMT Dec 29,2006

I have been known to criticize “meaningless key” syndrome from time to time. I’m not going to go through the arguments here, but I’d like to tell you about a lovely example of how badly things can go wrong when you follow that route.

Table4 is at the end of a sequence of referential integrity constraints; in fact, rather than the usual 1-to-many relationships, it’s almost a 1-to-1 all the way from table1 to table2 to table3 to table4 with just a few cases where the data volume multiplies up.

Area_code is part of the (undeclared) primary key of table1, which should have cascaded all the way to table4. But everything is done with meaningless keys generated by Oracle sequences.

So when a query like “select * from table4 where area_code = ‘EMEA’ and …” is needed, it has to be written as a four-table join, because the area_code never made it to table4.

As a result, the query has to pick up 500,000 rows from table4 and then join through three tables to eliminate 400,000 of them.

Although the near sequential arrival of the data meant that the required data set would be scattered across a relatively small number of blocks, and offered a terrific caching benefit to a nested loop strategy, the optimizer doesn’t detect such time-based caching benefits and chose a series of very expensive tablescans and hash joins [see footnote] to produce the result. (This looks like an ideal case where it would be nice to use a “localised” setting for the optimizer_index_cost_adj).

Performance was not very good. Bad luck !

Pragmatic solution – put a stack of hints into the SQL.

Alternative pragmatic solution – write a piece of SQL to set the clustering_factor on the three critical indexes back to a reasonable value – but that needs some careful research as it might have nasty side-effects in other bits of code, and it needs to be injected into the right place in the existing routine that generates statistics.

“Ideal” solution – redesign the system and reload a couple of terabytes of data !!! Perhaps not so ideal.

Foot note(April 2009):

Originally my comment about the choice of path had been: Worse still, because of the arrival times of the data, the “meaningless primary key” indexes have a misleading clustering_factor and look like rubbish to the optimizer – so the execution path uses three consecutive hash joins, doing tablescans on three very large tables.

But following an email request for a test case to demonstrate what I meant by this comment, I realised that the comment was irrelevant in this case. The cost for a join into the primary key ignores the clustering_factor, and in this case the join into the foreign key was “nearly one to one”, so the impact of the clustering_factor on the cost was irrelevant.

The big difference between my opinion of the sensible execution path and the optimizer’s opinion was that I could see the time-based caching benefit – and the optimizer’s nested loop costing did not allow for it.

Comments (19)

19 Comments »

Another way: some materialized-view with rewrite query on.

Some benefits for some query.

What you think about that?

Comment by Antonio — December 29, 2006 @ 9:17 pm GMT Dec 29,2006 | Reply
Ouch this is a dagger in my heart. Some of my developers and I are at odds over this. Things start out with a MARKET table, then a CUSTOMER table that has a market_id/cust_id compound key. Everything other table then adds on more fields. I had to firmly say “no” when I caught wind that they were planning to create a table with a 17-field primary key.

Where do you draw the line?

It’s as if they are completely anti-sequence (although we do use sequences to generate customer numbers and things like order numbers).

On the 17-field PK table, I basically told them that they don’t even need a PK. They admitted that this table will rarely be queried anyway (don’t even ask). But they have in their minds a rule that every table MUST have a PK. On a lot of these bulk-loaded tables, unique queries are very infrequent, rather we often try to query for all records for a given month. So I’d just as soon drop the 11-key PK on another table (the last 5 fields are NEVER queried, they were just added to guarantee a unique record for the PK), and just have a non-unique index on market, customer, year, month (yes year/mon are separate number fields).

Is this a dangerous idea to spread? Will I be shooting myself in the foot?

Comment by Don Seiler — December 29, 2006 @ 9:39 pm GMT Dec 29,2006 | Reply
So what happens when you try to update the customer’s phone number? Oops…

This looks like a job for a bitmapped join index to me.

Comment by Jason Bucata — December 29, 2006 @ 10:09 pm GMT Dec 29,2006 | Reply
Jonathan,

Am I right in thinking that the four tables are parent-child, with table 1 at the parent end and table 4 at the child end? Maybe I’m misinterpreting the situation, but it sounds like the majority of the problem lies in normalisation, although use of a meaningful key might eliminate one of the joins. Or am I confused?

Comment by David Aldridge — December 30, 2006 @ 12:48 am GMT Dec 30,2006 | Reply
Antonio, Not really viable in this particular case, but certainly another option to consider. A simpler option (in this case) would have been to use triggers – but we would have to be sure that no code used ‘select *’ or did inserts without referencing a list of columns. There’s always a trade between overheads, time to implement, and risk of error when fixing systems.

Jason, this was constantly changing data in an OLTP system, so no bitmapped indexes allowed – see the articles referenced in this posting.

David, That’s about right. But rhe problem wasn’t in the normalisation, it appeared because the real data expressing the foreign keys had been changed to meaningless numbers and therefore was not available. Conceptually, the four PK’s should have been:
T1 (a)
T2 (a,b)
T3 (a,b,c)
T4 (a,b,c,d)

They became:
T1 (a) pseudo-key –> (p)
T2 (p,b) pseudo-key –> (q)
T3 (q,c) pseudo-key –> (r)
T4 (r,d) pseudo-key –> (s)

Don, logically every ENTITY needs a primary key, and the simplest physical model turns one entity into one table, so every table gets a primary key constraint. However, I will compromise on the way from the logical to the physical – and the appearance of a 17-column PK would make me consider hacking the design for mechanical (performance) reasons – knowing that (a) I would probably have to introduce some otherwise redundant physical structures to ensure data correctness and (b) some parts of the system would have to do more work because the structure had been compromised to make other parts of the system more efficient.
In passing, a CUSTOMER table with a MARKET as part of the primary key sounds a little suspect – in general I would expect to see Markets, Customers, and Customers_in_markets.

Comment by Jonathan Lewis — December 30, 2006 @ 7:09 am GMT Dec 30,2006 | Reply
Jonathan,

Does this mean that you are an advocate for meaningfull keys?

During my lessons database design I aquired a strong feeling that meaningfull keys are better; at least for the human mind, so the datamodel is easier to understand. Oracle is no human though, it works with 1 and 0. In the past when I was still a developer most DBA’s told me that joining based on numeric values is faster then joining on alfanumeric values. If this is the case, then a meaningless key can improve performance.

Now I’m only a starting DBA. My main business is handeling application servers, but the database is getting a bigger part of my day to day work. Since you always tell the world that proof is all you need, I created two small scenarios:

1. all tables in t_tables, with a meaningless primary key (just a sequence), all_tab_columns in t_columns with a meaningless primary key, a referencing t_table_id to t_tables and column names.
2. all table names in p_tables (pk=table_name) and all_tab_columns in p_columns with the table_name as primary key. Next I created a foreign key referencing p_tables.

Next I calculated statistics by using dbms_stats.gather_schema_stats(ownname=>’TESTUSER’,cascade=>TRUE);

Now I started testing. I only have 461 rows for both tables and I made an oops (I forgot to exclude the current schema).Also the testset is a bit light, which might explain my funny results:

Joining on the meaningless key I got the following results when querying for table_name and column_name (using set timing on and issueing the query, next I called the same query with set autotrace traceonly explain statistics).

time elapsed: 00:00:00.61
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=6 Card=461 Bytes=1 4752) 1 0 MERGE JOIN (Cost=6 Card=461 Bytes=14752) 2 1 TABLE ACCESS (BY INDEX ROWID) OF 'T_TABLES' (TABLE) (Cos t=2 Card=72 Bytes=1440) 3 2 INDEX (FULL SCAN) OF 'T_TABLES_PK' (INDEX (UNIQUE)) (C ost=1 Card=72) 4 1 SORT (JOIN) (Cost=4 Card=461 Bytes=5532) 5 4 TABLE ACCESS (FULL) OF 'T_COLUMNS' (TABLE) (Cost=3 Car d=461 Bytes=5532) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 56 consistent gets 0 physical reads 0 redo size 11912 bytes sent via SQL*Net to client 842 bytes received via SQL*Net from client 32 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 461 rows processed

Next I used meaningfull keys:

time elapsed: 00:00:00.56
Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=4 Card=465 Bytes=1 9530) 1 0 NESTED LOOPS (Cost=4 Card=465 Bytes=19530) 2 1 TABLE ACCESS (FULL) OF 'P_COLUMNS' (TABLE) (Cost=3 Card= 465 Bytes=11625) 3 1 INDEX (UNIQUE SCAN) OF 'P_TABLES_PK' (INDEX (UNIQUE)) (C ost=0 Card=1 Bytes=17)
Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 67 consistent gets 0 physical reads 0 redo size 16479 bytes sent via SQL*Net to client 842 bytes received via SQL*Net from client 32 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 465 rows processed

If I enterpret this correctly, it seems that joining on a meaningfull alphanumeric key is faster (in time) and has a better execution plan (for this scenario). This might be different in other scenarios, but does this also proof that joining on alphanumeric characters is just as fast as joining on numeric characters (and therefore justify the use of meaningfull primary keys)?

Comment by Jacco Landlust — December 30, 2006 @ 3:02 pm GMT Dec 30,2006 | Reply
“…it’s almost a 1-to-1…”

Of course, there’s no such thing as “almost” 1-to-1 – it’s either 1:1 or 1:m… :)

(but I understand what you meant – the data presently in the system is “almost” 1:1, even though this is irrelevant to the logical model)

Comment by Jeff Kemp — December 31, 2006 @ 5:12 am GMT Dec 31,2006 | Reply
Jeff, your rebuke and clarification are both correct and timely. Thank you.

Jacco,
I advocate starting with a proper design and meaningful keys; but I will comporomise when converting the logical model to the physical.

Your example does demonstrate that meaningful keys can be faster than meaningless keys, but what it has really done is demonstrate the point that with meaningful keys the optimizer can find more efficient execution plans. Note how the plan changed. (I suspect you meant the meaningful PK of the p_columns table was (table_name, column_name)).

In general, I would expect meaningless keys to be very slightly faster then meaningful keys because the CPU for checking 4 or 5 bytes for a numeric column would be less than the CPU cost of checking the first 20 or 30 bytes of a meaningful key. But then you get into the arguments about smaller indexes, better use of buffer cache, needing a proper PK index anyway etc. etc. et.

Comment by Jonathan Lewis — December 31, 2006 @ 7:53 am GMT Dec 31,2006 | Reply
Jonathan, allow me to first make the “this design was inherited” disclaimer (our company purchased this product from a company, rather than be forced to upgrade against their will; then they hired us in-house to keep it running). Of course we’ve done very little to fix things since. Yes for some reason the customer key is compound on market id and cust id. And yes, there are 16 different sequences for cust id, one for each market. I’d really like to sit down and “talk” to the original developers.

Comment by Don Seiler — January 1, 2007 @ 6:27 pm GMT Jan 1,2007 | Reply
The problem is really with the optimizer, not the meaningful keys.
If it did a complete job it would recognize the fact that there is a RI relationship between the tables and optimize for that instead of faffing around on the nature of indexes and their clustering factors.

Meaningful keys are very nice in theory but cause incredible space overhead in any practical application with any data volumes beyond a few thousand and any keys longer than a coupla bytes. And as previously noted, better never have to modify a value in a key: the cascading is atrocious…

Before anyone mentions the ubiquitous “disk is cheap” mantra: oh no, it most definitely is not.

Fact is: there is not a SINGLE successful commercial application design in the market that has made use of meaningful keys. Not one! They all, without exception, use the other kind. And for a very good reason.

Comment by Noons — January 1, 2007 @ 11:54 pm GMT Jan 1,2007 | Reply
Noons,
RI means “for every B there must be an A”. Why should that dictate the access method used by the optimizer to find the appropriate “a” for each of several million examples of “b” ? It doesn’t; it is necessary to examine (a.o.) the data scatter to decide whether to use a high-precision method millions of times, or a brute-force method once.

I’m not inclined to believe your “Fact is:” statement, having seen a few applications that work reasonably well and don’t follow the “meaningless key” strategy – but it could, of course, depend on you definition of “successful” and “commercial”. I recall a conversation I had with the head of marketing of the world’s most popular application in field X (name withheld to avoid easy idenfitication of product). It went like this:

“How come this is the most succesful application in the business, the design is appalling and the performance is abysmal”.
“Yes, but you should see the opposition”.

But even if your statement were true, it would still be based on a logical error. Even if every currently successful commercial application used meaningless keys that wouldn’t mean they were successful because they used meaningless keys – think Betamax; and being “commercially successful” doesn’t mean they should be taken as role-models for an in-house development.

The “disk is cheap” mantra is, I agree, almost invariably a hand-waving exercise designed to distract attention from bad design and worse progamming. But in my experience it’s not usually the use of real keys that creates the demand for more space.

Comment by Jonathan Lewis — January 2, 2007 @ 8:55 am GMT Jan 2,2007 | Reply
Everything has its place.
For fully normalized oltp datamodels i like meaningfull keys. But when I was forced to use them in a datamart where my joints where best performed with star-transformation, I had a problem. Linking the dimensions with multicolumn meaningfull keys just did not work.

Comment by Maarten Vinkhuyzen — January 3, 2007 @ 12:25 pm GMT Jan 3,2007 | Reply
[…] Meaningless Keys keys, more keys and nothing but keys… Primary Keyvil, Part I (checkout parts 2 and 3 also) Composite keys are evil Surrogate Key != Primary Key […]

Pingback by vorpal.cc/blog » Primary, Composite, and Surrogate Keys — March 25, 2007 @ 2:05 am BST Mar 25,2007 | Reply
[…] 5th, 2007 Jonathan Lewis wrote an article about Meaningless Keys. It’s an old article, but I just read it today and I found it very inspiring. So immediately I […]

Pingback by Meaningless Foreign Keys « I’m just a simple DBA on a complex production system — September 5, 2007 @ 1:36 am BST Sep 5,2007 | Reply
I can’t resist throwing in my three hap’orth.

I resolutely come down on the side of meaningless keys generated by sequences, but not for technical reasons.

Having inherited a total disaster of a db, which was a migration from an equally disastrous sql server db, I had to deal with the beast. Documentation was non-existent, and so I arbitrarily added an ‘_ID’ PK to every table, across the board. Because I knew that I didn’t know, what this achieved was to definitively eliminate uncertainty about keys, forever. Time and again, this has paid dividends, when we have unearthed information which shows that the supposedly ‘natural’ keys were based on flawed assumptions and designs. If you’re in uncertain undocumented hell, always go for surrogacy.

Second, our users have just decided that they must be able to split the db into six, work on each section, and then bolt it back together again. Using the standard trick of juggling sequence starts and increments to avoid pk/fk collisions, means that I can do this easily. I’d love to see someone code that up with natural keys.

Once again, the best protection against uncertainty and lunatic user requirements is meaningless drivel pk’s. Since I’ve never ever seen a fully documented database, can anyone tell me what it’s like to work on one?

Comment by Tharg — October 16, 2007 @ 7:25 pm BST Oct 16,2007 | Reply
Noons, you’re wrong by saying :
Fact is: there is not a SINGLE successful commercial application design in the market that has made use of meaningful keys. Not one! They all, without exception, use the other kind. And for a very good reason.
SAP uses meningful keys and I THINK that they are successful.

Comment by Pierre Cote — November 28, 2007 @ 3:33 pm GMT Nov 28,2007 | Reply
Pierre / All –

That is something that I have been wondering about recently – how other packages handle this. My context is Siebel. Although it has a concept of user keys – and generally enforces uniqueness with unique indexes – it uses a field called row_id on every table. row_id is used for all foreign keys on other tables and has its uniqueness enforced with unique indexes as well. (If you’re following along in a data model, you will actually see a second field here – conflict_id – that only comes into play with remote database synchronization – ignore it for this discussion.)

I believe that this is generally because what the package designer may have assumed to be unique is not necessarily the case for all customers that will use the software. Even the user key columns frequently are populated with what I would consider a meaningless key – such as a customer identifier.

Indeed – I offer that much of what folks may believe is meaningful data is in fact no different than the “meaningless keys” that Jonathan was referring to. When we want to identify a computer monitor or other expensive piece of equipment, we have to have some way to describe it – so we invent something called a serial number. I suppose that we could also identify it by it’s model #, product plant, production line, production date, and a number representing that it was the N’th one of that model produced at that plant on that line on that date – but no one does that do they? Everyone assigns a serial number and puts it on a sticker on the back. How is that really any different from from the meaningless keys being discussed here? The only difference is that we tell people about it and use it in business processes. Same thing for customer ids…

So – with regards to SAP and other packages (this is not to be argumentative – I genuinely want to know):
1. Are you saying that they don’t use an artificial thing like row_id in Siebel?
2. Do they use things like serial #’s and customer id’s that I suggest are really “meaningless”
3. How do they handle data that does not have a generally accepted id – but a very extensive (or possibly non existent!) key otherwise – such as address?

Siebel handles the user key for address in an “interesting” manner that sometimes leads to issues – it concatenates several fields, truncates to 100 characters, never updates that field and hopes for the best… Larger implementations have had to address this issue. It is common for billing and other type of applications of a national scale to generate a “house ID” or something similar – again something that I liken to a meaningless key.

By the way – Jonathan’s condemnation of having to traverse levels in the data model doesn’t apply in very many cases in practice with Siebel. Where it does, they have usually already denormalized the data.

Comment by Joe Coffey — November 29, 2007 @ 12:23 am GMT Nov 29,2007 | Reply
And the solution of changing clustering factor (describe above as alternative pragmatic solution) become useless therefore overridden by setting optimizer_index_cost_adj parameter.

Comment by Rudi Kristanto — April 27, 2009 @ 12:39 am BST Apr 27,2009 | Reply
[…] Meaningless Keys keys, more keys and nothing but keys… Primary Keyvil, Part I (checkout parts 2 and 3 also) Composite keys are evil Surrogate Key != Primary Key […]

Pingback by Primary, Composite, and Surrogate Keys | David Hogue's Blog — October 27, 2011 @ 11:52 pm BST Oct 27,2011 | Reply