In the comments to a recent blog posting about log file syncs, Tony Hasler has produced a stunning observation about Oracle and ACID, in particular the D (durability) bit of transactions.
The observation can best be described with a demonstration (which I have run on versions from 8.1 to 11.2) involving three sessions, one of which has to be connected with sysdba privileges.
create table t1(n1 number); insert into t1 values(1); commit;
session 2 (sysdba)
oradebug setorapid NNN -- where NNN is v$process.pid for lgwr oradebug suspend -- suspend lgwr processing
update t1 set n1 = 2; commit; -- the session hangs, waiting for log file sync
select n1 from t1;
Is session 3 going to report the value 1 or 2 ?
Since lgwr has not been able to write the redo generated by session 1 you might think that the answer would 1; but the session has already done all the work normally associated with a commit, (despite any indications you might see in the manual that the data is not committed until it is written) and the actual answer to my question is 2.
Now go back to session 2 and issue:
shutdown abort startup
When the database starts up, is a query against t1 going to return 1 or 2 ?
It ought to return 2, because after instance recovery you ought to be able to repeat the query of session 3 and get the same answer. But you will see 1 – lgwr didn’t write the transaction’s commit record before the crash, so even though session 3 saw the result of the transaction, the transaction wasn’t recoverable. (In passing, session 3 could have been an incoming query from a remote database.)
When I first saw this behaviour my first thought was that you would have to be very unlucky to see a real problem because the window of opportunity for an error was very small, but after making a few comments about the behaviour at the NoCOUG conference earlier on today I found myself at lunchtime sitting next to someone who described a problem that they had had a few years earlier when their archive destination had become full and a complex piece of synchronisation between two databases had failed in a strange way.
If lgwr has to stop for any reason, that doesn’t stop sessions running on for a little while until the log buffer is full – and if those sessions commit (each one will hang after it commits, of course) you could end up with the results of a number of committed, but unrecoverable, transactions being captured by a remote database and then you’re in trouble if the local database crashes – especially if the local programs decided to re-run the programs to generate the missing data … and that’s the problem my lunchtime companion may have been seeing. (I think I may soon be doing a few experiments to see if I can reproduce the problem – one of the other ideas I’ve assumed to be true is that you can’t get more data into the log buffer if the log writer has initiated a log file switch that has not yet completed, maybe that’s wrong as well.)
Footnote: See comment 2- the problem is arguably one of isolation rather than durability (though I have finally come down on the side of the problem being durability).
Update (Sunday 10:30 am – GMT/UTC + 1)
There have been some interesting comments and suggestions in response to this posting, and I think there is room for some elucidation about how things work and speculation about how they could be changed.
When you modify data blocks you generate redo change vectors (for the blocks you want to modify, and for undo blocks) and these redo change vectors are combined to form a redo change record. Ignoring a small complexity introduced in 10g, the changes are applied to the blocks in memory as your session copies a redo record into the log buffer; the order (ignoring 10g again) is: copy record into buffer, apply change to undo block, apply change to target block. So changes you make are immediately applied to the current blocks (again ignoring 10g).
When you commit a transaction you update the transaction table slot that identifies your transaction to show that the transaction is committed – this is a change to an undo segment header block so it generates a redo change vector. THe handling of this vector is completely standard – your session creates a redo record from it and puts the record into the log buffer – which means it is applied to the undo block at that moment, which means everyone can now see the effects of your transaction. AFTER the record is in the redo buffer your session will post lgwr to write the log buffer to the log file and goes into a “log file sync” wait. But this means (a) your transaction is visible to everyone else before it is recoverable (durable) and (b) you’re the only session that thinks the transaction hasn’t yet “really committed”.
Ideally your session needs to do something that puts the “commit record” into log buffer without letting anyone else see the undo segment header change, and only lets them see the change after your session has been posted by lgwr (and even that leaves a window for inconsistent recovery if the instance crashes after the write has happened but before your session has been posted – see my reply to Martin Berger). A change of this type, though, would probably require a massive rewrite of the logging mechanism because it would make the commit record handling different from every other record. There’s also the side effect this might have on the “group commit” mechanism and consequent scalability.
Prompted by an observation by Sokrates about “other” sessions waiting on log file sync in 11.2, and comments from Mark Farnham about a “write guarantee”, I’ve been trying to think through the consequences of a possible strategy for doing a “two-phase” commit with some “delayed logging” on the undo segment headers. The steps are as follows:
- Session commits and updates the undo segment header with ‘commit waiting on lgwr’ status rather than ‘committed’, and posts lgwr.
- Any session looking at a block affected by that transaction now has to deal with indeterminate data – so waits on the transaction slot (with enq: TX in mode 4) rather than assuming a rollback or commit. This session, however, could be allowed to check where the log writer had got to and recognise that the relevant write had been completed (and this is analogous to existing code that runs when a session times out on a log file sync). Note that the underlying strategy here is one that is already in place to cater for distributed transactions caught between the PREPARE and COMMIT phases. The waiting session could also post pmon if it thought that the driving session had failed and was unable to receive the post back from lgwr (and this is analogous to the current behaviour when a process is trying to get a latch that seems to have been held too long).
- When the original session is posted by lgwr to continue it could modify the transaction table slot without generating redo – or possibly generate it but not issue a log file sync for it – thus allowing the second session to release its TX (mode 4) and continue. (This would require some special handling during recovery for transactions that had been written to lgwr but not confirmed back to their sessions.)
- There is a consistency problem, though, because commits are sequenced by SCN, and at present lgwr simply posts all sessions that it can see should be cleared by its most recent write – they need not resume in exactly the right order, which means transactions could go from “waiting on lgwr” to “committed” in the wrong order. This means we need another two-step approach: lgwr to write process ids to a linked list in commit (i.e. log buffer) order, with a new background process created to walk the list and post each waiter in turn, waiting (or watching) for it to acknowledge before going on to the next entry in the list.
Feel free to poke holes.