A recent thread on the OTN database forum asks the question:
“… since redo also has past and current information why can’t redo logs be used to retrieve that information? Why is undo required when redo already has all that information.”
The thread has generated some interesting replies – but to a large extent they describe how the undo and redo work rather than explaining why the designers at Oracle Corp. chose to implement undo and redo the way they did. Since I’m sitting in an airport (Zurich – where the coffee bar in the business lounge has a very large bowl of Lindt chocolate squares available) waiting for a plane I thought I’d use the time to share my thoughts on the why.
Redo gives us two things: recoverability and efficiency. If we record a description of every change we make to a database then we can take an old copy of the database and re-apply the list of changes to bring it up to date – that’s recoverability. If we make that list the first point of protection for our data changes we need only keep writing a small volume of information to disc, rather than writing each data block to disc the moment we change it – that’s an aid to efficiency. (Of course, although focusing all the changes into a small space gives us I/O efficiency, anything that focuses a lot of activity into a small space also gives us a point of contention).
Undo (in some form) has to exist if we want read-consistency. No-one else should be allowed to see our data changes until we commit them and our changes shouldn’t even be visible to queries that started running before we committed them, so we have to keep older versions of the data somewhere. This is where things get a little counter-intuitive (but very clever) with Oracle.
Take a close look at the following redo record; it’s a record generated in the middle of a transaction and it’s the redo that Oracle generated as we changed the fourth column of a particular row in a table from the value 42 to the value 43.
REDO RECORD - Thread:1 RBA: 0x000056.00000028.0014 LEN: 0x0128 VLD: 0x01 SCN: 0x0000.0028efae SUBSCN: 3 02/09/2010 14:59:40 CHANGE #1 TYP:0 CLS:20 AFN:2 DBA:0x0080065f OBJ:4294967295 SCN:0x0000.0028efae SEQ: 24 OP:5.1 ktudb redo: siz: 112 spc: 5414 flg: 0x0022 seq: 0x0112 rec: 0x19 xid: 0x0002.028.000009af ktubu redo: slt: 40 rci: 24 opc: 11.1 objn: 49428 objd: 49428 tsn: 5 Undo type: Regular undo Undo type: Last buffer split: No Tablespace Undo: No 0x00000000 KDO undo record: KTB Redo op: 0x02 ver: 0x01 op: C uba: 0x0080065f.0112.18 Array Update of 1 rows: tabn: 0 slot: 41(0x29) flag: 0x2c lock: 0 ckix: 11 <----- row 42 in block ncol: 4 nnew: 1 size: 0 KDO Op code: 21 row dependencies Disabled xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x0180000a hdba: 0x01800009 itli: 2 ispac: 0 maxfr: 4863 vect = 9 col 3: [ 2] c1 2b <----- Change fourth column back to 42 CHANGE #2 TYP:0 CLS: 1 AFN:6 DBA:0x0180000a OBJ:49428 SCN:0x0000.0028efae SEQ: 41 OP:11.19 KTB Redo op: 0x02 ver: 0x01 op: C uba: 0x0080065f.0112.19 Array Update of 1 rows: tabn: 0 slot: 41(0x29) flag: 0x2c lock: 2 ckix: 11 <----- row 42 in block ncol: 4 nnew: 1 size: 0 KDO Op code: 21 row dependencies Disabled xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x0180000a hdba: 0x01800009 itli: 2 ispac: 0 maxfr: 4863 vect = 9 col 3: [ 2] c1 2c <----- Change fourth column to 43
You’ll notice that the redo record consists of two “change vectors” (CHANGE #1 and CHANGE #2), and I’ve labelled the two lines showing that the change was from 42 to 43. In principle, the Oracle redo log really does contain all the information we need to generate old versions of data. So why does Oracle introduce the undo tablespace for read-consistency when the redo log seems to have all the required information ?
To answer this question we need to think multi-user. (It’s very easy to forget, when you’re trying to understand some feature, that Oracle is a multi-user system but if you ignore that fact you forget to think about what’s going on from the point of view of other sessions, and also forget to think about the impact of other sessions).
Thought experiment number 1: I am executing a long-running transaction and lots of other sessions are executing little queries and short transactions. The other sessions will have to see data as it was before I changed it. If they have to visit the redo to see this data the online redo log file will have to be arbitrarily large to ensure that it can hold all the necessary information – and that’s information required to log the forward and backward changes from all the sessions for all the transactions that have executed since I started my transaction. What’s supposed to happen if the redo log files fill ?
Thought experiment number 2: If my long running transaction fails and has to roll back then the backward changes I need will be intermingled with all the forward and backward changes that everyone else has made since my transaction started, so my session will probably end up doing a lot of random I/O to find the backward changes. Of course we could make this task more efficient (and address number 1 a little better) by making each transaction write its own log file – but then we increase the complexity of file handling (opening and closing files for every start and stop of a transaction), and of applying the redo log files in the right order for recovery.
Thought experiment number 3: Currently Oracle does not tidy up every changed block when a transaction commits. If I commit a transaction but don’t tidy up block X, how is the next session that visits that block going to know that my transaction committed and when it committed. If all we used was the redo log, the next session would have to go to the point in the redo log that had recorded the change to block X, and walk forward until it found the commit record (or got to the end of the redo log, or found a suitable SCN in the redo – and thus could determine that my transaction had not committed, or had not committed “in time”) But then every redo record for a given transaction would need a “forward pointer” to the next redo record – which means every time you created a redo record you’d have to go backwards to update your previous redo record; and that’s not going to be efficient. Plan B, then, is to have a mechanism for keeping an indefinitely long list of transactions and their commit SCNs somewhere in the database – and that “indefinitely” is not a nice idea.
And so it goes on – you may find some activity where using the redo log as the source for undo would be perfectly acceptable, but when you start to consider all the options of a multi-user system with mixed activity you find that some types of activity are likely to run into, or cause, problems.
Oracle’s solution is to keep the “usable” backward changes, the undo, somewhere completely different from the redo. Rather than create a third type of storage structure they’ve chosen to put it into the database using ordinary data blocks. Think about how this addresses the issues I’ve mentioned so far.
Because undo blocks are just database blocks they are protected by the redo log and can be recovered as the database is recovered without introducing an extra code path to recovery. The “backward” change vector we saw recorded in the redo log record was actually the forward change vector for the undo block – we’ve introduced an overhead in the redo log because we write undo twice (but that’s what we do with every data change anyway – write it once on the block and once in the redo log) The undo blocks also benefit from the standard LRU mechanisms – if an undo block is being used a lot it stays in memory, and that’s nice because recently generated undo is more likely to be undo we want to use for read-consistency.
Because we keep the undo streams inside the database we don’t have to worry about sessions constantly opening and closing files as they start and stop transactions – we just have to set up some sort of catalog that lists the current active transactions and points to the place in the undo tablespace where that transaction is currently putting its undo records. (This is what the transaction table in the undo segment header is about (see glossary)).
Because we use database blocks as the unit of storage and keep the undo for different transactions separate, when I write my undo my records are nicely clustered – I acquire and fill one block at a time – so if I have to roll my transaction back a single undo block read will get me lots of undo records, and I minimise the random I/O I have to do to read back the undo stream.
If I start a very long running transaction while other sessions are also generating undo, there is a degree of independence between my undo and their undo. In a compromise between “all the undo goes into one file” and “every transaction gets its own undo file”, Oracle has allowed for multiple undo segments – i.e. the equivalent of a relatively small number of shared files – which means that a single long-running transaction is restricted to a single undo segment and all the other transactions can keep re-using (over-writing) any remaining space in the undo tablespace much more effectively. (Of course we can still end up with “file full” errors – but the strategy makes it possible for the database to survive concurrent activity much longer on the same amount of disk space.)