Redo

August 19, 2011

Filed under: Infrastructure,Oracle,redo — Jonathan Lewis @ 2:58 am BST Aug 19,2011

In the comments to a recent blog posting about log file syncs, Tony Hasler has produced a stunning observation about Oracle and ACID, in particular the D (durability) bit of transactions.

The observation can best be described with a demonstration (which I have run on versions from 8.1 to 11.2) involving three sessions, one of which has to be connected with sysdba privileges.

Session 1

create table t1(n1 number);
insert into t1 values(1);
commit;

session 2 (sysdba)

oradebug setorapid NNN  -- where NNN is v$process.pid for lgwr
oradebug suspend  -- suspend lgwr processing

session 1

update t1 set n1 = 2;
commit;    -- the session hangs, waiting for log file sync

session 3

select n1 from t1;

Is session 3 going to report the value 1 or 2 ?

Since lgwr has not been able to write the redo generated by session 1 you might think that the answer would 1; but the session has already done all the work normally associated with a commit, (despite any indications you might see in the manual that the data is not committed until the log is written) and the actual answer to my question is 2.

Now go back to session 2 and issue:

shutdown abort
startup

When the database starts up, is a query against t1 going to return 1 or 2 ?

It ought to return 2, because after instance recovery you ought to be able to repeat the query of session 3 and get the same answer. But you will see 1 – lgwr didn’t write the transaction’s commit record before the crash, so even though session 3 saw the result of the transaction, the transaction wasn’t recoverable. (In passing, session 3 could have been an incoming query from a remote database.)

When I first saw this behaviour my first thought was that you would have to be very unlucky to see a real problem because the window of opportunity for an error was very small, but after making a few comments about the behaviour at a NoCOUG conference earlier on today I found myself at lunchtime sitting next to someone who described a problem that they had had a few years earlier when their archive destination had become full and a complex piece of synchronisation between two databases had failed in a strange way.

If lgwr has to stop for any reason that doesn’t stop sessions running on for a little while until the log buffer is full – and if those sessions commit (each one will hang after it commits, of course) you could end up with the results of a number of committed but unrecoverable transactions being captured by a remote database and then you’re in trouble if the local database crashes – especially if the local programs re-execute to generate the missing data … and that’s the problem my lunchtime companion may have been seeing. (I think I may soon be doing a few experiments to see if I can reproduce the problem – one of the other ideas I’ve assumed to be true is that you can’t get more data into the log buffer if the log writer has initiated a log file switch that has not yet completed, maybe that’s wrong as well.)

Footnote: See comment 2– the problem is arguably one of isolation rather than durability (though I have finally come down on the side of the problem being durability).

Update (Sunday 10:30 am – GMT/UTC + 1)

There have been some interesting comments and suggestions in response to this posting, and I think there is room for some elucidation about how things work and speculation about how they could be changed.

When you modify data blocks you generate redo change vectors (for the blocks you want to modify and for associated undo blocks) and these redo change vectors are combined to form a redo record. Ignoring a small complexity introduced in 10g, the changes are applied to the blocks in memory as your session copies a redo record into the log buffer; the order (ignoring 10g again) is: copy record into buffer, apply change to undo block, apply change to data block. So changes you make are immediately applied to the current blocks (again ignoring 10g).

When you commit a transaction you update the transaction table slot (in the undo segment header) that identifies your transaction to show that the transaction is committed – this is a change to an undo segment header block so it generates a redo change vector. The handling of this vector is completely standard – your session creates a redo record from it and puts the record into the log buffer – which means it is applied to the undo segment header at that moment, which means everyone can now see the effects of your transaction.

After the record is in the log buffer your session will post lgwr to write the log buffer to the log file and goes into a “log file sync” wait. But this means (a) your transaction is visible to everyone else before it is recoverable (durable) and (b) you’re the only session that thinks the transaction hasn’t yet “really committed”.

Ideally your session needs to do something that puts the “commit record” into log buffer without letting anyone else see the undo segment header change, and only lets them see the change after your session has been posted by lgwr (and even that leaves a window for inconsistent recovery if the instance crashes after the write has happened but before your session has been posted – see my reply to Martin Berger). A change of this type, though, would probably require a massive rewrite of the logging mechanism because it would make the commit record handling different from every other redo record. There’s also the side effect this might have on the “group commit” mechanism and associated scalability.

Prompted by an observation by Sokrates about “other” sessions waiting on log file sync in 11.2, and comments from Mark Farnham about a “write guarantee”, I’ve been trying to think through the consequences of a possible strategy for doing a “two-phase” commit with some “delayed logging” on the undo segment headers. The steps are as follows:

Session commits and updates the undo segment header with ‘commit waiting on lgwr’ status rather than ‘committed’, and posts lgwr.
Any session looking at a block affected by that transaction now has to deal with indeterminate data – so waits on the transaction slot (with enq: TX in mode 4) rather than assuming a rollback or commit. This session, however, could be allowed to check where the log writer had got to and recognise that the relevant write had been completed (and this is analogous to existing code that runs when a session times out on a log file sync). Note that the underlying strategy here is one that is already in place to cater for distributed transactions caught between the PREPARE and COMMIT phases. The waiting session could also post pmon if it thought that the driving session had failed and was unable to receive the post back from lgwr (and this is analogous to the current behaviour when a process is trying to get a latch that seems to have been held too long).
When the original session is posted by lgwr to continue it could modify the transaction table slot without generating redo – or possibly generate it but not issue a log file sync for it – thus allowing the second session to release its TX (mode 4) and continue. (This would require some special handling during recovery for transactions that had been written to lgwr but not confirmed back to their sessions.)
There is a consistency problem, though, because commits are sequenced by SCN, and at present lgwr simply posts all sessions that it can see should be cleared by its most recent write – they need not resume in exactly the right order, which means transactions could go from “waiting on lgwr” to “committed” in the wrong order. This means we need another two-step approach: lgwr to write process ids to a linked list in commit (i.e. log buffer) order, with a new background process created to walk the list and post each waiter in turn, waiting (or watching) for it to acknowledge before going on to the next entry in the list.

Feel free to poke holes in the strategy; you can be sure it needs a lot of refinement, after all it took a group of clever people at Oracle some time to come up with the whole undo/redo/read-consistency strategy, and I’ve only spent a couple of hours thinking about.

Comments (84)

84 Comments »

Weird conditions… The potential for other databases reading the affected one becoming corrupt is there. But no other local changes derived from reading this “wrong” data would be stored anyway, would they? Once the lgwr stops, any intervening changes until the db stops are rolled back. Or did I miss something?
I won der what would happen to a dsitributed update transaction…

Comment by Noons — August 19, 2011 @ 3:56 am BST Aug 19,2011 | Reply
- Not just “other databases”… Just about everything that queries the data that is in this doubtfull state, e.g. call your broker to check if that order you just made on the web made it through or not, beacause after you hit “Place order” nothing happened (hanging commit). The broker will check through another DB session and acknowledge that the order was placed, while in fact it wasn’t… You can come up with lots of scenarious that are no good.
  
  Comment by Marcus Mönnig — August 19, 2011 @ 8:40 am BST Aug 19,2011 | Reply
- Noons,
  
  No dependent changes in the local database could become permanent, so the database would be self-consistent after recovery – which is partly why I thought the problem had only a small window of opportunity for causing problems. However, as I said, an incoming query from a remote database will see the committed data – and that query could be “insert as select” – so another database could record a dependency on unrecoverable committed data.
  
  Comment by Jonathan Lewis — August 19, 2011 @ 10:00 am BST Aug 19,2011 | Reply
one question: did the session 1 return from commit with success return code or hang ?

if session commit just hanged then ok, it’s isolation problem, not that of durability. crash recovery worked just fine.

if the session RETURNED from commit without throwing error then … it’s durability issue in it’s worst.

Comment by lnd — August 19, 2011 @ 5:59 am BST Aug 19,2011 | Reply
- Ind,
  
  I see your point, and I think I agree. If you define “committed” to mean the session itself has to know that the change has become durable as well as public then this is. indeed, an isolation problem rather than a durability problem,
  
  Comment by Jonathan Lewis — August 19, 2011 @ 10:07 am BST Aug 19,2011 | Reply
  - yes, exactly. The commit has not completed. It hanged. Normal way of things for the system is to rollback such changes which is what happened after the db bounce.
    
    but the problem is that OTHER sessions see changes which are NOT commited. That’s weird. Some bits and bytes controlling snapshot based isolation have failed, in my opinion.
    
    Comment by lnd — August 19, 2011 @ 11:10 am BST Aug 19,2011 | Reply
    - “…but the problem is that OTHER sessions see changes which are NOT commited. That’s weird….”
      
      absolutely
      
      So, I don’t understand Jonathan’s statement “… the correct answer to my question is 2 ..”
      The correct answer should be “1”, since the commit is not complete – a “normal” commit implies that the session itself knows that the change has become durable
      
      Comment by Sokrates — August 19, 2011 @ 12:02 pm BST Aug 19,2011
    - skip that statement of Jonathan. It might not be 100% accurate. The real thing is that he has raised the issue and has a test scenariou.
      It’s pretty clear it is a bug and it is pretty clear who has to worry about it. The only questions are the scope, impact, possible workarrounds for the bug.
      However, I believe this is deep in the kernel functionality of the oracle and the fix can only be provided by the oracle.
      
      Comment by lnd — August 19, 2011 @ 12:18 pm BST Aug 19,2011
    - Sokrates,
      
      So, I don’t understand Jonathan’s statement “… the correct answer to my question is 2 ..”
      That’s a case of semantic confusion caused by a badly posed question.
      “Would you expect …?” has no wrong answers because your expectations and mine need not match, and neither need match ultimate outcomes.
      
      I’ve reworded the text to clarify (I hope).
      
      Comment by Jonathan Lewis — August 19, 2011 @ 12:23 pm BST Aug 19,2011
  - ok, my head was really on the wrong track at the end of the day. The simple thing to say about commit is this:
    
    if commit statement returns without errors then transaction is surely commited. Commited transaction must be durable (including crash recoverability.)
    
    if commit statement hangs then transaction may or may not be commited. We simply do not know. One way to check this is to query the data in another session.
    Which tells us the transaction is commited.
    
    The other check is to perform the crash recovery. Which tells us the transaction effectively was not commited. Ooops.
    
    Yet one more check is to mine redo logs. I bet the commit marker wouldn’t be there. Because lgwr was suspended.
    
    Then the crash recovery did the right thing – it rollbacked the transaction.
    
    Comment by laimisnd (lnd) — August 19, 2011 @ 6:58 pm BST Aug 19,2011 | Reply
  - Ind,
    I’ve changed my mind again after re-reading the definitions of isolation and durability.
    
    Isolation means that transactions should behave in such a way that they do not interfere with each other – i.e. as if they were serializable.
    
    Durability requires a committed transaction to persist beyond failures.
    
    What happens here is that if session 3 starts before session 1 issues a commit it will constantly see the data as it was before the start of the transaction run by session 1; if session 3 starts after session 1 issues a commit it will constantly see the data as it was at the moment session 1 issued the commit. There is no mix of before and after images.
    
    So, notwithstanding the fact that there is room to argue about what what we ought to mean by a “commit”, it is the issuance of a “commit;” that the database software takes as a commit, so the transactions are isolated, but the first transaction is not durable.
    
    Comment by Jonathan Lewis — August 21, 2011 @ 9:09 pm BST Aug 21,2011 | Reply
you write
”
…
commit; — the session hangs, waiting for log file sync
….
….
… but the session has committed …
….
”

These two statements don’t compile for me – could you please clarify ?
Which commit are you talking about ?

commit write wait immediate; — is it ?

Comment by Sokrates — August 19, 2011 @ 7:45 am BST Aug 19,2011 | Reply
- Sokrates,
  
  “but the session has committed” – the session has done all the work associated with a “commit;” call.
  I will edit the posting to clarify.
  
  As the code shows, I used the traditional commit – if I had been using “commit write wait immediate” (or had changed the defaults) I would have said so.
  
  Comment by Jonathan Lewis — August 19, 2011 @ 10:11 am BST Aug 19,2011 | Reply
now I managed to reproduce the issue – with one change:
Session 1 does “update t1 set n1 = 2;” before Session 2 does “oradebug suspend”

Then Session1 issues a commit [write wait immediate], which hangs

Then Session 3 “select n1 from t1;”
gives a “2”
THIS IS A BUG IN MY OPINION, BECAUSE THE COMMIT OF SESSION1 HAS NOT RETURNED YET

When I then – in session 3 – issue the same select “select n1 from t1;” again – session 3 also hangs !! (and doesn’t get the “2” for a second time) – v$session_event and v$session_wait show me that session 3 is now waiting for “log file sync”, which seems to be the correct behaviour to me

So, I would say, we have an issue here (somewhere in the handling of the log buffer) because the first select gives a “2” – obviously from the log buffer.

Comment by Sokrates — August 19, 2011 @ 8:29 am BST Aug 19,2011 | Reply
- We now have two types of commit, one that waits for the log file sync and one that doesn’t. If “session 1” had done a “commit write nowait” then the listed behavior would have been as you might expect. The sole purpose of that is to allow processes to see committed data that isn’t guaranteed to be written to the redo log.
  
  Anyone got a 9i or 10gR1 instance to see if it was that change which prompted this behavior ?
  
  Comment by Gary — August 19, 2011 @ 9:45 am BST Aug 19,2011 | Reply
  - Gary,
    
    I thought of that but 9i introduced log parallelism, which might have been the trigger for change. That’s why I also tested 8.1.7.4 – and got the same behaviour.
    
    In fact, when I read Tony’s original comments my initial response was “it’s bound to happen because of the way that redo copy works” – I think Steve Adams described the underlying mechanism of the redo copy action in his book years ago, but I hadn’t drawn the obvious conclusion from his description.
    
    Comment by Jonathan Lewis — August 19, 2011 @ 10:19 am BST Aug 19,2011 | Reply
    - Jonathan,
      
      In “Oracle8i Internal Services for Waits, Latches, Locks, and Memory” by Steve Adams, the main point about redo copy is:
      
      “No-wait mode is used for most gets against these (redo copy) latches, because Oracle can use any one of them to protect the copy into the log buffer.”
      ( http://goo.gl/MDkRd )
      
      If this is the explanation for the observed behaviour, could you explain why? :-) Thanks!
      
      Marcus
      
      Comment by Marcus Mönnig — August 19, 2011 @ 11:22 am BST Aug 19,2011
    - Marcus,
      
      I thought there was something in his book which said (paraphrasing massively) that the action was:
      
      get copy latch
      get allocation latch
      allocate
      drop allocation latch
      copy change record into buffer
      apply change record to data blocks
      drop copy latch
      call lgwr if appropriate
      
      In this context “appropriate” means if the total allocaiton now exceeds 1MB, 1/3 of the log buffer, or if the change record is a commit record (or contains a commit vector now that we’re in 10g).
      Maybe it’s on his website rather than in the book.
      
      Comment by Jonathan Lewis — August 19, 2011 @ 11:38 am BST Aug 19,2011
    - I don’t see that this list of actions explain the problem we see here. I think we have to look at UNDO and how and probably WHEN it is handled when committing.
      
      My understanding is that when session 3 executes “select n1 from t1;” there should be an ITL entry in the data block changed by the transaction of session 1. This should enable session 3 to do a consistent read and get the pre-commit version of the data by looking at the UNDO generated for the transaction of session 1.
      
      It seems to me when committing, the writing of UNDO is done before the writting of REDO (or these actions are done asynchronosly in parallel and since LGWR is stuck only UNDO is actually written), so the commit would be recorded in the UNDO data, but not in the REDO data. A consistent read thus would see the change as commited. The instance restart would obviously fix this, since it looks at REDO data which does not have the COMMIT.
      
      Comment by Marcus Mönnig — August 19, 2011 @ 12:21 pm BST Aug 19,2011
    - Marcus,
      
      First, there is no undo record for a commit, there is only the forward change vector to modify the transaction table slot in the undo segment header.
      
      Secondly, when you (session 3) read the ITL you first check the transaction table slot to see if the transaction for that ITL has been committed – and we (session 1) have updated the slot to say the transaction is committed (even though that change has not yet been written to disc).
      
      General point – when looking at the fine details you need to be careful about words like “writing”. We WRITE undo only when dbwr copies buffers to disc. We modify undo blocks with undo records the moment we put the redo records into the log buffer.
      
      I will have more to say, but it’s 8:00 am where I am at present, and I’m just off to work.
      
      Comment by Jonathan Lewis — August 19, 2011 @ 1:01 pm BST Aug 19,2011
    - Are we saying that a COMMIT is happening as soon as it is confirmed that the redo record has been created in the buffer, and not when it’s actually on disk?
      Not sure I believe that.
      
      I agree that this is related to the ISOLATION of the transactions, since I can understand that for performance reasons the UNDO copy of the changed block could be created (for read consistency) as soon as the REDO record is in the redo buffer.
      Otherwise we would be reliant on the disk I/O write rate for our read consistency.
      
      As for read consistency across database links, the Oracle docs say it all http://download.oracle.com/docs/cd/B12037_01/server.101/b10739/ds_txnman.htm#i1008473
      
      Comment by Darryl Griffiths — August 19, 2011 @ 12:39 pm BST Aug 19,2011
    - Darryl,
      
      See my comments to Alex about how open the definition of the expression “commit” can be. From one perspective the commit is the moment you say commit because that’s the moment the transaction table slot in the undo segment header is modified; from another perspective the commit (the one you’ve issued) is the moment you are told by lgwr that the relevant redo record is on disc.
      
      The docs don’t say it ALL about read consistency across database links, by the way – they only give us a couple of useful pointers in the dozen or so lines.
      
      Comment by Jonathan Lewis — August 19, 2011 @ 1:05 pm BST Aug 19,2011
    - Jonathan,
      
      Thanks for giving these details, regarding your “first” and “secondly”. I didn’t know this on this detailed level and look forward to learn more about such from your upcoming book. Your comments also made me curious and I already collected some further information from the web and started to study it.
      
      Having said that, I’d like to add that I do like models. People make models to explain certain things. Not everything at once with all details included, but enough to give an explaination for/of the currently observed problem. Of course, they should not be wrong, but certain simplifications are useful in my opinion. As such, I think what I wrote was a model of the reality with all its technical details to get a point across. Martin Berger’s comment below is another example of imho valid simplifications that are very useful to get on. People think and work on their very own level of detail and knowledge and I am convinced that they get very good results with that.
      
      Regarding “being careful about words like writing” – I lost you there. As far as I understand, you think I should use WRITE when something is written to disc and not when a buffer/block/record is modified in memory. I don’t think so.
      
      The main point in my comment was about the order in which writing redo and – here I go again – writing of undo (simplification for “modifying the transaction table slot in the undo segment header”) was done in Oracle. If I pushed some button, sorry, that wasn’t my intention.
      
      Comment by Marcus Mönnig — August 20, 2011 @ 7:50 pm BST Aug 20,2011
    - Marcus,
      
      Sorry if I sounded abrupt in my previous comment, I was only trying to make a point, not do a put-down.
      
      > Regarding “being careful about words like writing” – I lost you there. As far as I understand, you think I should use WRITE when something is written to disc and not when a buffer/block/record is modified in memory. I don’t think so.
      
      That is exactly what I meant. There are occasions when it’s clear from the context that someone is thinking if in-memory modifications when they use the word “write” ; but in this case the complexity of the problem revolves around the case that we modify the log buffer and then “write” to the log file – and if we use the word “write” to cover the action on memory and disc, anything we say is much more open to ambiguity and mis-interpretation.
      
      Comment by Jonathan Lewis — August 21, 2011 @ 9:33 am BST Aug 21,2011
- Sokrates,
  
  Which version of Oracle are you using – and do you have any audit enabled – or anything else going on that might result in SYS-recursive SQL that needs to do a recursive commit ?
  
  Comment by Jonathan Lewis — August 19, 2011 @ 10:13 am BST Aug 19,2011 | Reply
  - sql > select * from v$version where rownum=1; BANNER ------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production sql > select dbms_utility.port_string from dual; PORT_STRING ---------------------- x86_64/Linux 2.4.xx sql > show parameter audit NAME TYPE VALUE ------------------------------------ --------------------------------- ------------------------------ audit_file_dest string /home/oracle/product/11.2.0.2/rdbms/audit audit_syslog_level string audit_sys_operations boolean FALSE audit_trail string NONE
    
    Comment by Sokrates — August 19, 2011 @ 10:18 am BST Aug 19,2011 | Reply
    - Sokrates,
      
      Not sure why we see different behaviour – you might try dumping the log file to see if there are any unexpected items in it. (Do a switch log file before running the experiment to keep the log file small).
      I have to say that my 11.2.0.2 testing was on Windows 32 bit, and showed a couple of other anomalies – most significantly I couldn’t resume lgwr after suspending it, and had to restart the service.
      
      Comment by Jonathan Lewis — August 19, 2011 @ 11:09 am BST Aug 19,2011
- Sokrates,
  
  > the first select gives a “2″ – obviously from the log buffer
  
  Foreground sessions don’t read the log buffer . Remember that Oracle modifies current blocks in place, so the change to the block has been made before we even have to worry about what’s going on with the timing in the “commit”. The first time around your session sees the “2” because it’s really there in the (buffered) data block. It’s interesting that the second time around your session waits for a ‘log file sync’, though. There is a significant difference (for small transactions) between “commit”, “commit batch” and “commit immediate”. In 10g the first two will combine the commit redo change vector into a single redo record with all the other redo change vectors whereas the “immediate” option makes the commit redo change vector appear as a separate redo change record.
  
  I like your comment that session 3 ought to wait on a log file sync the first time around – I’ll be adding a couple of comments to the posting along those lines in a few minutes.
  
  Comment by Jonathan Lewis — August 21, 2011 @ 8:52 am BST Aug 21,2011 | Reply
My new question here is what does mean to wait feedback from commit?

Before your post I (and many other people) thought that session does data visible after commit only after OK from “log file parallel write”.

Session1 “the session hangs, waiting for “log file sync” – in light of the opened details it becomes meaningless. Not only session in this example, but waiting for “log file sync” by any session is meaningless. What does mean to wait for “log file sync” if data is visible for another sessions yet?

Comment by Alex — August 19, 2011 @ 12:18 pm BST Aug 19,2011 | Reply
- Alex,
  
  I think there are many comments about Oracle that are not made with sufficient precision – look how much fault Sokrates found with my original statements, even though I thought I was explaining something carefully. Comments made about commits and log file sync fall into this area. I think if you read “the commit is not complete until lgwr confirms that it has written” it would be perfectly reasonable to assume that this meant no-one would see the effects of your transaction until that moment. In fact everyone can see the effects of your transaction the moment you issue the commit – the log file sync only ensures recoverability of the transaction on instance failure.
  
  As Ind said, this is actually an isolation problem, not a durability problem – but discussing it gets messy because different people may have different interpretations of the term “commit” when they discuss what’s going on.
  
  Comment by Jonathan Lewis — August 19, 2011 @ 12:36 pm BST Aug 19,2011 | Reply
  - “In fact everyone can see the effects of your transaction the moment you issue the commit – the log file sync only ensures recoverability of the transaction on instance failure.”
    
    For me this is a precise problem description… I’m not willing to accept this… :-) COMMIT should be an atomic action that writes REDO to disk and make the change visible to other transactions.
    
    Comment by Marcus Mönnig — August 19, 2011 @ 12:55 pm BST Aug 19,2011 | Reply
    - Yes, exactly what I wanted to say. (And Jonathan is right, not wrong)
      
      Comment by Alex — August 19, 2011 @ 12:58 pm BST Aug 19,2011
I believe your analysis as an isolation problem rather than a durability problem is correct.
I believe the root conundrum derives from the read consistency model being based on the scn. The data block is modified in cache and on disk with the new scn during the commit but prior to the lgwr action completing. So a query that starts during the commit but after the block modification sees the current block as being an appropriate vintage even though that scn is still in the future with respect to recovery (durability). So by the read consistency model it does not even check undo. To close this gap efficiently, there would need to be some global posting of scn written to log file (probably in a rotating buffer of some significant length where any confirmation the scn is written is good enough so it doesn’t get too hot as single memory reference itself). Then when the query fired it would have an “scn log written guarantee” (slwg) and only have to check for younger blocks than the slwg (of course if the block is younger than the start of query scn, then it has to do the usual read consistency), and when checking in that narrow window between the start of query scn indicating a current block is old enough and the slwg indicating it is not too young (narrow unless lgwr is suspended!) it probably would check for a later slwg first (which would be in force for the rest of the query execution), but then it would have to wait to return rows from that block until it could secure a qualifying slwg or attempt to produce a version of the block older than the slwg.

Or maybe someone can come up with a better algorithm. My point being I *think* by describing what closing the gap would require from a logical requirements viewpoint I *think* the scope of current negative side effects can be circumscribed and understood unambiguously. I *hope* describing the commit as in progress and the current block as being in the future with regard to the recovery stream until lwgr is written is helpful.

I *think* my coffee has kicked in this morning, but if I’ve leapt to an error, surely it will be pointed out. I’m going with posting this now rather than thinking about it for a few hours.

mwf

Comment by Mark W. Farnham — August 19, 2011 @ 1:21 pm BST Aug 19,2011 | Reply
- >So by the read consistency model it does not even check undo
  
  that would be bad… delayed block cleanout is an issue here. To know the fact if commit is done or not it is necessary to check the rollback segs.
  Personally I am waiting for Jonathan’s book to read more about snapshots isolation, ITLs, block scn’s, commit scn’s and all that complicated data structures. Tanel Poder promissed it would be discovered in the book :)
  So far I have at least a few missing pieces how snapshot isolation works exactly in Oracle. And I did explored the web for missing pieces, see here: http://laimisnd.wordpress.com/2011/07/19/reverse-engineering-consistent-reads-multi-versioning-itl-ora-1555-row-level-locks-concurency/
  
  My guess is too that Oracle finally made some error in the way it handles the isolation. See, it’s not trivial, for example – many people talk about block commit scn. But sorry, this is not enough. This particular data item of the block means may be the “last block commit scn” or something else. Block can be modified by many transactions just because it has many rows. Some rows can be commited, some not. SQL is about rows, not blocks. Here ITL’s must be used I think. The block has to be undone transaction after transaction, until the block image is what the query needs. Quite complicated, isn’t it ?
  Now it is enough to erroneously mark some data item(s) in this quite complicated data structure BEFORE lgwr writes to disk and here you are.
  
  Comment by laimisnd (lnd) — August 19, 2011 @ 7:29 pm BST Aug 19,2011 | Reply
- Mark,
  
  I think your comment about the process being driven by SCN, and the concept of the next session needing to “close” the gap between the “commit in memory” and “commit on disk” is significant, and the implied threat of a single point of contention with a “scn log written guarantee” is the thing to watch out for.
  
  It’s made a bit more complicated because of the effect of “group commits” when the log writer writes, and because lgwr doesn’t actually “know” that it’s responding to a commit. There’s also the complication of the system doing the same thing whether it’s the first time around (real-time) or the second time around (crash-recovery) – I can see a gap appearing with this approach where a query run after recovery could produce a different result from the query when it originally ran.
  
  I’ve added a couple of comments to the post that I think pick up your point, and remove the threats.
  
  (I see someone else has picked you up on your comment about the block being modified “on disk” before the log is written – but I know that you know that the data block can’t get to disc until after the relevant log has been written.)
  
  Comment by Jonathan Lewis — August 21, 2011 @ 9:04 am BST Aug 21,2011 | Reply
Thinking this through a bit more, perhaps it would be more straightforward to assign the query start time to the greatest scn with a log written guarantee. Then the blocks associated with commits in flight would be judged too young.

The mechanism and timing of updating the scn to be used by each query then would prevent any new hitch in the reading process (though perhaps a few more blocks would require read consistency rollback clones, but those would be precisely the right blocks that should have been isolated.

I hope I’m not missing something…

Comment by Mark W. Farnham — August 19, 2011 @ 3:46 pm BST Aug 19,2011 | Reply
- Could be some side-effects with a single session that commits a change and immediately requeries the data and expects to see its change. The implicit commits at the start of a DDL would fit that pattern. I wonder whether those commits do or do not wait for the log file sync.
  
  Comment by Gary — August 20, 2011 @ 2:53 am BST Aug 20,2011 | Reply
“The data block is modified in cache and on disk with the new scn during the commit but prior to the lgwr action completing”

With “data block” do you mean ITL? AFAIK row data is modified in cache even before commit. And on disk is flushed by dbwr after some time.
In my opinion the point is simpler: in the ITL the transaction is marked as committed, but it should not.

Seems to be a clear bug, incredible as it sounds.

Comment by Emyl — August 19, 2011 @ 3:48 pm BST Aug 19,2011 | Reply
- Emyl,
  
  > in the ITL the transaction is marked as committed
  
  As a general principle, the “commit” doesn’t touch anything except the transaction table slot in the undo segment header – that’s why Oracle originally called the activity the “fast commit”. It was only the subsequent introduction of the “commit cleanout” that resulted in Oracle revisiting some of the modified blocks to do a partial cleanout on the ITL.
  
  Comment by Jonathan Lewis — August 21, 2011 @ 9:15 am BST Aug 21,2011 | Reply
Hey Jonathan! How come your blog on my blog gets more comments than my original blog! It isn’t fair :-)

First comment:

The question of isolation versus durability is purely a matter of definition. What databases (the more general term Resource Manager is actually more appropriate as message queuing systems and the like are included included) are supposed to do is a) make the transaction durable and THEN b) make its consequences visible to other transactions. What Oracle does in a non-distributed transactions is a) make the consequences of the transaction visible to other transactions and then b) make it durable. So what happens if you have a failure inbetween (a) and (b)? Well if you consider the transaction to be committed at point (a) then there is a durability issue as Oracle will not guarantee to keep the transaction (in fact it will guarantee to dispose of it!). On the other hand, if you consider the transaction to be committed at point (b) then there is an isolation problem because uncommitted data is being exposed to outside transactions. I am not saying anything new here. Just repeating for emphasis what Jonathan and others have already said earlier in this blog.

Second comment:

The seriousness of this issue seems to escape a lot of people. This is not a purely academic matter or an issue that may surface under obscure failure conditions. The test that I devised (and that Jonathan has repeated here) involving suspending the log writer is for demonstration purposes only. On a heavily loaded system some data is almost guaranteed to be lost every time the database crashes. I’ll try and expand on this in a follow-up blog of my own over the weekend.

Comment by tonyhasler — August 19, 2011 @ 7:20 pm BST Aug 19,2011 | Reply
Sigh. I had hoped to avoid the whole paragraph about delayed block cleanout by the reference to a block of appropriate vintage.

Taking one more pass, I *think* I have it down to being just this simple: Oracle is apparently incrementing the current scn from which subsequent transactions get their order in time immediately at the start of a commit rather than after that commit is securely written.

Comment by Mark W. Farnham — August 19, 2011 @ 8:31 pm BST Aug 19,2011 | Reply
The existence of a ‘COMMIT NOWAIT’ option indicates that that there is the possibility of a committed transaction that is not durable.

Anyone care to throw RAC into the problem ? Multiple log writers and the failure of one to make a transaction durable doesn’t prevent other instances from seeing the ‘committed not durable’ data and making further ‘committed AND durable’ data changes based on those reads. As such, I say this is an isolation problem. Or rather our isolation level is implicitly ‘READ COMMITTED’ not ‘READ DURABLE’

Comment by Gary — August 20, 2011 @ 3:02 am BST Aug 20,2011 | Reply
- Gary,
  
  One of the features of RAC is that a current block will not be passed from one instance to the other until the redo log responsible for the change has been written to disc – at least, that’s what the manuals say – and there are statistics relating to this behaviour like: “gc current block flush time” and waite event “gcs log flush sync”, so I don’t think RAC increases the risk of the problem becoming more visible.
  
  Comment by Jonathan Lewis — August 21, 2011 @ 11:44 am BST Aug 21,2011 | Reply
  - Anyone care to throw DataGuard into the problem ?
    A DataGuard environment changes the semantics of a “commit” depending on the configured protection mode, correct ?
    ( see http://download.oracle.com/docs/cd/E11882_01/server.112/e17023/dbresource.htm#DGBKR270 for details)
    
    So, the docs seem to tell me, that our problem cannot occur in “Maximum protection mode” – would be nice to know if this is really true
    ( afterwards it would also be nice to know whether my bank operates my account in “maximum protection mode”, but that’s only my personal problem and I can ask them )
    
    Comment by Sokrates — August 21, 2011 @ 7:29 pm BST Aug 21,2011 | Reply
    - Sokrates,
      
      I wouldn’t bet on the documents being 100% accurate, as that would imply a dramatic change in the code handling the way that lgwr and the commit mechanism work. However, I don’t fancy setting up a standby database right now to re-run the test.
      
      Comment by Jonathan Lewis — August 21, 2011 @ 9:35 pm BST Aug 21,2011
As I read this post the first time my reaction was:
That MUST NOT be true!
But it IS, and somehow it WILL BE because of a single fact: The way how current computers work.
To be honest, I do not understand what really is going on at a commit. I’m hunting for this information provided in a way even I can understand, but that’s not important. Based on my un-knowledge of all these details (blocks, ITL, redo streams, blaaah) I’m using non-technical words, just not to mis-use these.
From my point of view a ‘commit’ should do 3 things (at least)
1) write anything to disk which makes the transaction durable
2) make the new data from this transaction visible to others
3) inform the initiator of the ‘commit’ everything is done as requested
maybe there are also other steps, but these 3 are enough to show my argument:
We all would like these steps to be done in a Atomic way: either all of them, or none, and all at the same point in time.
But this seems impossible for me.
They are slightly different tasks: writing to disk (or even SSD) needs some interaction with a ‘external’ subsystem (from the point of CPU) – you can never be sure what’s the status, if something inbetween happens. (guess the data is written do disk and at the moment the SCSI-‘commit’ comes back to the PCI-Card, the power is down!?)
Even point 2 and 3 can not be done at the same time. At least I know no CPU where one running process can return something (via function, semaphor, …) to another running process and in the same cpu-command flip a bit in memory so all others ‘see’ it.
I hope you agree, it’s not possible to be really Atomic?
So we can only discuss about the preferred ORDER of the steps:
Right now without any better knowledge I would prefer:
1st write to disk
2nd let the others access the new data
3rd inform the transaction about successful commit
What is your preferred order? and how would it affect the testcase?
Martin

Comment by Martin Berger (@martinberx) — August 20, 2011 @ 7:11 am BST Aug 20,2011 | Reply
- >I hope you agree, it’s not possible to be really Atomic?
  
  those ideas can answer your questions:
  
  – kill yourself. see, database crashes if lgwr encounters problems! That is for a reason!
  
  BTW, this particular test case uses suspend lgwr. lgwr and database is prevented from crash and hence the problem.
  
  – atomicity. redo operates at device(?) block size for a reason.
  
  – timing. For example, when 2 entities (cpu and disks for example) communicate they agree on one thing: if no message comes in nn seconds then kill yourself.
  
  Comment by laimisnd (lnd) — August 20, 2011 @ 9:08 am BST Aug 20,2011 | Reply
- Martin,
  
  I think you’ve summed up the root problem, which is the sequence of events.
  
  Currently it is possible for session 3 to see that data has been committed before session 1 knows that the transaction has been made durable. Even if we stopped session 3 from seeing the data until session 1 agreed that it the transasction had been made durable we could be in a position where the write to disc had occurred but instance failure meant that session 1 had not seen the acknowledgement; in which event recovery would result in a transaction being visible after recoery when it had apparently not been (durably) committed before the crash. (See also the comments from Gary about RAC.)
  
  Comment by Jonathan Lewis — August 21, 2011 @ 9:25 am BST Aug 21,2011 | Reply
  - >Currently it is possible for session 3 to see that data has been committed before session 1 knows that the transaction has been made durable
    
    that in itself is not that bad. In terms of oracle a commit record in redo file is what makes transaction commited.
    
    Only then oracle should mark transaction commited in the memory: rollback segment headers, blocks, etc, etc.
    If however this part (marking commit in memory) fails then oracle should crash and hope that after restart everything is recovered and fine.
    
    Who does this part ? A user session process? if so then it could be that Oracle does not want a user session to crash a database. Even if it must!
    
    On a contrary, lgwr failure automatically means instance crash. Which solves 99% of issues including the one we are investigating. The problem is that a SUSPEND was used to stop lgwr which prevenetd it from crashing and allowed uncommited transaction to be seen. At least an Immediate crash would leave a very tiny window for this problem to manifest.
    However it would not solve in completely.
    
    Comment by laimis (lnd) — August 21, 2011 @ 3:19 pm BST Aug 21,2011 | Reply
    - thinking more about the implementation I believe oracle left this issue knowingly.
      This is why:
      
      Incredibly, this particular issue left THE DATABASE in CONSISTENT state. Changes based on reading uncommited data could not be saved in the database cause lgwr was suspended and later the database was crashed!
      
      The problem really manifests itself only for the third party observers. For example, for replication into another database as was mentioned in the blog.
      
      On the other hand, trying to solve this issue confronts Oracle with difficult choises:
      
      – lgwr is already a bottleneck. Everyone who needs txn throughput knows that. That means Oracle would rather avoid lgwr doing commit marking in memory (rollback headers, blocks). That leaves this part to be done by user sessions => processes
      
      – if redo write goes first and rollback header marking goes next then the failure of user process to correctly mark rollback headers will create another problem – commited(in terms of redo) transactions won’t be seen.
      to deal with this issue oracle would need to crash the database to prevent the problem. But user processes are many and they do tend to crash.
      Actually, even deciding when and if to crash the database can be tricky – if user process hangs then it hangs, and for how long, or may be it is just slow?
      
      On the other hand it can be argued that won’t seeing commited (in the redo) transactions is not that bad. I can not answer this question write away but it is about basic required properties of serializable transactions. Intuitively it seems all right: transaction won’t be seen until db crash recovery or until some mechanism cleans hanging user sessions.
      
      Comment by laimis (lnd) — August 21, 2011 @ 4:02 pm BST Aug 21,2011
    - p.s. there is one more aspect of the problem: the houskeeping taks of cleaning of failed user sessions.
      
      Currently it is done by smon and smon does not crash the database ( to initiate redo reading and crash recovery)
      Smon appearently reads memory (rollback headers) and can safely rollback transactions left by failed sessions. Why – because transaction is not yet commited into redo files.
      
      Other way arround – commit into redo first and update rollback headers later – effectively mean smon would need to read the redo stream to determine if session commited or not. That would not only complicated the cleaning task but can hang the database to prevent redo from being wrapped over…
      
      Summing it all up again: it is not that rare that engineers compromise. If probability of something going wrong is small then engineers just neglect the issue in favour of cleaner and safer and faster code.
      Actually, the durability is often neglected feature. Many databases just tell you to keep your hardware up and running. To say more, today most of persistent storage has the write cache backed by batteries. Meaning there is no 100% guarantee your redo really really goes to disk. Unless may be you really use two or more separate SAN’s for your redo. Haven’t seen that myself.
      
      Comment by laimis (lnd) — August 21, 2011 @ 5:13 pm BST Aug 21,2011
  - “….Currently it is possible for session 3 to see that data has been committed before session 1 knows that the transaction has been made durable…..”
    
    Even worse:
    We have seen that currently it is possible for session 3 to see that data has been committed before the transaction has been made durable ! ( a small but subtle difference)
    
    In my opinion, it would suffice, if Oracle could guarantee, that another transaction can see committed data only after it was made durable (surviving any crash). This does not imply that the transaction, which did the commit, received no exception ( from the commit ). For example, after been made durable, there could be a network failure in the round trip to the client – I think, we could accept that:
    Session 1 would not know if its “commit” succeeded, but it could ask (for example via a new session ) again *AND WOULD GET AN ANSWER WHICH WOULD SURVIVE A CRASH*
    
    Comment by Sokrates — August 21, 2011 @ 7:18 pm BST Aug 21,2011 | Reply
- Martin,
  I agree that it is a sequencing issue, combined with the fact that there is no way to have a Disk Write done by one process and an update to an in-memory buffer done by another process to be together as an Atomic operation.
  So, there does seem to be a (serious) hole.
  
  Hemant
  
  Comment by Hemant K Chitale — August 22, 2011 @ 3:44 am BST Aug 22,2011 | Reply
I wonder if what we are demonstrating here is the ability to go around the transactional mechanism. By that I mean, session 2 with oradebug is outside transactions. Oracle is optimistic, so it expects all will eventually be well, hence session 3 can see it, which is sensible because it is in fact, done at a time after the commit. Doing the shutdown abort and recovery materializes an error that says that it wasn’t actually true, so sorry, the commit was a mistake, sometimes the glass is half-empty.

Now, if by doing this there is a way to corrupt the redo stream so the recovery can’t work, that would be scary.

Comment by jgarry — August 21, 2011 @ 6:11 am BST Aug 21,2011 | Reply
- Joel,
  
  In what way is oradebug “outside transactions” ?
  
  In any event, as Tony Hasler pointed out, oradebug is there simply to emulate a slowdown (and utlimate failure) in lgwr – the contradictory outcomes can still occur in “real life”.
  
  Comment by Jonathan Lewis — August 21, 2011 @ 9:18 am BST Aug 21,2011 | Reply
Regarding the proposed steps to perform the commit (the “two phased commit” protocal ) :

– the houskeeping of failed user sessions. I have already commented about that above. The idea is that smon or whatever has to scan redo stream to know if txn was really commited (written into redo) and possibly suspend database to prevent redo wrapping. Or if redo is already wrapped then…

– not sure if the last step for lgwr to post sessions in exact commit order is required. Serializable transactions should not care that much about timing of sql commands. If however this is required then again, it introduces quite a heavy performance penalty…

All that makes me believe even more that Oracle just compromised here :) …

Comment by laimisnd (lnd) — August 21, 2011 @ 5:43 pm BST Aug 21,2011 | Reply
- laimisnd (lnd),
  
  > The idea is that smon or whatever has to scan redo stream to know if txn was really commited (written into redo) and possibly suspend database to prevent redo wrapping. Or if redo is already wrapped then…
  
  If a session issues a commit but fails to update the transaction table that means the session failed to issue a commit, and the normal process recovery mechanisms take over – pmon will tidy memory and smon will rollback.
  
  If a session issues a commit but fails (or gets lost or stuck) before it can call lgwr then the redo record for that commit will be written the next time any other session issues a commit. There is no need for smon to get involved, and the next session that needs to check whether the transaction was committed or not will see the transaction state as “writing” and can detect that the write has to be complete.
  
  – not sure if the last step for lgwr to post sessions in exact commit order is required. Serializable transactions should not care that much about timing of sql commands. If however this is required then again, it introduces quite a heavy performance penalty…
  
  I’m not completely sure either, but I keep wondering about possible race conditions in things like buffered queues.
  
  All that makes me believe even more that Oracle just compromised here
  
  There is no question about that, and perhaps it was a compromise that had a vanishingly small risk of error (or of an error being noticed) when the decision was made about 22 years ago.
  
  Comment by Jonathan Lewis — August 21, 2011 @ 9:25 pm BST Aug 21,2011 | Reply
  - >If a session issues a commit but fails to update the transaction table that means the session failed to issue a commit, and the normal process recovery mechanisms take over – pmon will tidy memory and smon will rollback
    
    mmm, yes. but to know if session commited (in terms of redo written) or not one either have to scan redo (which would make oracle redo log similar to that of sql severs’s :) ) or lgwr has to feedback some background process what processes/txns it has posted back but potentialy nobody was there to listen and complete the delayed-delayed txn commit. Which would create quite a data stream between lgwr and that process. That’s seems to be doable but yet another contention point.
    
    Comment by laimis (lnd) — August 22, 2011 @ 5:17 am BST Aug 22,2011 | Reply
    - If the transaction table has been updated then the commit record has already been placed in the redo buffer.
      If the commit record got into the redo buffer it would have been written the next time the log writer wrote – smon doesn’t have to scan the log file to find this out.
      If you think smon has to scan the log file, could you please list the sequence of events – from the moment just before a session issues a “commit;” that would make this necessary.
      
      Comment by Jonathan Lewis — August 22, 2011 @ 7:33 am BST Aug 22,2011
    - I take your steps, in particular, step 3 “When the original session is posted by lgwr to continue it could modify the transaction table slot without generating ” Let’s call it “finalize commit.” Failure to finalyze the commit will result in txn left invisible.
      
      If user session crashes before it is able to finalyze the commit then there are two places to check: in redo and naturally lgwr knows that.
      
      Comment by laimis (lnd) — August 22, 2011 @ 10:11 am BST Aug 22,2011
[…] As the updated version of my last post stated, what Jonathan Lewis described as a “truly stunning observation” was uncovered: depending on your point of view, Oracle violates either the “I” in ACID or the “D”. I won’t repeat the technical description of the issue. This is already covered here and here. […]

Pingback by Why is ACID important? « Tony’s Oracle Tips — August 22, 2011 @ 9:35 am BST Aug 22,2011 | Reply
Hi Jonathan,

As you say : “Ideally your session needs to do something that puts the “commit record” into log buffer without letting anyone else see the undo segment header change, and only lets them see the change after your session has been posted by lgwr”….

Does that mean that one of the options to avoid this is :putting the commit record to redo buffer and posting back by lgwr to the user session should be an Atomic operation ?

Comment by Saurabh Manroy — August 23, 2011 @ 3:55 pm BST Aug 23,2011 | Reply
- Saurabh Manroy,
  
  For an issue as subtle as this you need to be very clear in your question. Who, in your question is “putting the commit record” ? If it is the session how can an action by your session and an action by lgwr be atomic ? If you think lgwr is going to do the “putting” – how should this be achieved ?
  
  Comment by Jonathan Lewis — August 23, 2011 @ 4:05 pm BST Aug 23,2011 | Reply
  - Apologies for using “putting”. I meant user session by that.
    Got your point.
    
    Comment by Saurabh Manroy — August 23, 2011 @ 4:40 pm BST Aug 23,2011 | Reply
[…] more about DMVs and one of the first items he covered was the log cache manager. Considering a post that was put up on the Oracle front last week by Jonathan Lewis, which led to this post by Gary […]

Pingback by SQLskills Training, Day 5: Beyond Training | Erin Stellato | Erin Stellato — August 24, 2011 @ 3:08 am BST Aug 24,2011 | Reply
Here’s an interesting suggestion to deal with the problem.

Comment by Jonathan Lewis — August 24, 2011 @ 10:02 am BST Aug 24,2011 | Reply
I’ve realised that my solution has a significant defect.

If you commit at an SCN, but the status is “in flight” until the write is complete then a session that starts running in read only (or serializable) mode could start a query after the commit but before the write.

This session would not see the “in flight” data at first, but would see it if it repeated the query at the same SCN after the log write complete.

The solution offered by Sai – see previous comment – avoids this because the queries would be running from an earlier SCN.

Comment by Jonathan Lewis — August 24, 2011 @ 10:07 am BST Aug 24,2011 | Reply
I see Tony Hasler has a very interesting follow-up to his previous posting: http://tonyhasler.wordpress.com/2011/08/24/acid-and-tpc-c/ Read it very carefully.

Comment by Jonathan Lewis — August 25, 2011 @ 8:47 am BST Aug 25,2011 | Reply
I tried one related test for which, in retrospect, I should have been able to guess the result. But that led to another question which I don’t have time to test at the moment.

Test #1 (in pseudocode. I can always attach the details later)

– create table test_acid (time, counter)
– create procedure acid_proc
as
insert into test_acid values(sysdate,1);
commit write immediate wait;
dbms_lock.sleep(15)
…
insert into test_acid values(sysdate,8);
commit write immediate wait;
dbms_lock.sleep(15)

– shutdown abort/restart
– job restarts automatically at the beginning (back at counter=1)

– set up update trigger to kick of job (for fire_trigger.status => ‘Y’)
– suspend LGWR
– fire trigger
– job runs
– shutdown abort/restart
– fire_trigger.status is not equal to ‘Y’
– job is not running

This makes sense as restarting the job requires Oracle writing the job status to the data dictionary and this goes through the redo stream for recovery.

Thought #2:

Where does Oracle bypass the redo stream? If that case exists, we now have local inconsistency.
What about flash recovery area? When does that get written and can we access the changed blocks after recovery?

I don’t have time to test this out just yet.

Comment by Henry Poras — August 25, 2011 @ 6:32 pm BST Aug 25,2011 | Reply
Sir, I have bought your book (ALPHA version which comprises of 2 chapters) 1.Getting Started 2.Redo Undo ………… both are really EXCELLENT .. The nice thing about this is – even a new person who does not know also can easily understand ORACLE .. that way you explain the things is simply SUPERB !! Sir , when the 3rd chapter will be publised ? Many thanks once again for yoru work on this .. Cheers Bix

Comment by berusadla — August 29, 2011 @ 11:35 am BST Aug 29,2011 | Reply
- Bix,
  
  Thanks for the comment. I hope I can maintain the information and clarity – I’m working on Chapter 7 at present, so there’s plenty more for you to read.
  
  The cycle of activity means that different people are doing different jobs with different chapters at the same time. I’ve just posted chapter 3 back with my final edits – but there’s a copy-edit process that takes place next, and I don’t know what else happens between my last edit and the Alpha release. With a little luck you might get Ch.3 quite soon.
  
  Comment by Jonathan Lewis — August 29, 2011 @ 6:32 pm BST Aug 29,2011 | Reply
The topic might be a little outdated but for those who are still thinking about the issue the following article might be interesting:

“Transaction Isolation and Lazy Commit”

Click to access ICDE2007_Transaction_Isolation_and_Lazy_Commit.pdf

The article talks about LazyCommitLCN – the concept seems very similar to me to that proposed by Sai. The article attacks the problem from a bit different angle: if a reader(reading session) detects transaction which is not yet made durable it initiates log buffer flush.

Comment by laimis (lnd) — September 5, 2011 @ 8:29 am BST Sep 5,2011 | Reply
laimis,

An interesing, but hardly, novel, approach. Essentially, rather than keeping rows locked until the lazy committed transaction is durable the page containing the row (or rows) is marked DIRTIED_BY_LC_XACT. What is the difference? Perhaps the expediting of the log flush by the reader is new but I seriously doubt it. Whichever way you cut it, this is essentially the same approach as used by IMS/FastPath and how I assumed (incorrectly) that the PL/SQL lazy commit optimisation worked here.

Perhaps the most fascinating part of this is that the writers of the article supposedly researched Oracle’s commit processing algorithm and like the rest of us (until now) stopped short of discovering the fatal flaw in Oracle’s approach.

Comment by tonyhasler — September 5, 2011 @ 9:59 am BST Sep 5,2011 | Reply
- >An interesing, but hardly, novel, approach.
  > Essentially, rather than keeping rows locked until the lazy committed transaction is durable the page containing the row (or rows) is marked DIRTIED_BY_LC_XACT. What is the difference?
  
  Exactly. I only wanted to search CS field for answers. Oracle is hardly the first facing the problem. It is important to identify it and try to apply some Computer Science to it. If none is found then may be someone can do some :)
  
  Personally I am interested if the WAIT by readers can be eliminated. That allways was what oracle was proud of – readers do not have to wait.
  Sai suggested an elegant thing – turning the clock back for some SQL commands:
  
  http://sai-oracle.blogspot.com/2011/08/is-oracle-acid-compliant-database.html
  
  That for me looks very much like using a flashback query.
  
  I only think this approach needs a few things to be investigated:
  – locking. Commit not only makes transactions visible, it releases locks too.
  – somehow it doesn’t feel obvious for me that only some SQL commands (excluding DML’s) are subject to this FlushSCN. This thought connects to locks release. It looks a bit suspicious that one type of commands (even of the same transaction) may see later data than others.
  – performance. That seems a minor issue actually.
  
  Actually this excellent blog by Saibabu Devabhaktuni (Sai) makes one think how SQL command(a query) obtains the CurrentSCN and it’s relationship to CommitSCN. First it looks like this is a serialization point, a protected critical section with respect to (at least) those two processes: a query getting CurrentSCN and getting CommitSCN (which is done by user process I assume, not lgwr itself.)
  Then the need for CommitSCN to be in redo stream dictates the ordering of actions: first modifying in-memory structures(rollback blocks) to mark transaction commited (and immediately visible to others, with locks released) and then redo flush (to disk).
  
  This requirement leads almost directly to those two approaches we discussed here:
  
  – locking and waiting for flush to complete
  
  – adjusting time: using something else for QuerySCN than currentSCN. That Saibabu Devabhaktuni’s idea uses MVCC at full and for that reason it looks very attractive and consistent with “readers do not have to wait”. It only has to be formally verified, and the correct way to think for me seems to look more closely to those key timestamps:
  
  -CommitSCN,
  -FlushSCN,
  -QuerySCN.
  
  just my 2 cents….
  
  Comment by laimis (lnd) — September 6, 2011 @ 8:27 am BST Sep 6,2011 | Reply
Useful Blog

Thanks

Comment by Ariq — November 21, 2011 @ 2:39 pm GMT Nov 21,2011 | Reply
Something similar when using AQs in RAC : Oracle Support article : “Query returns 0 rows after successful row insert/message enqueue when using Tibco Application in a RAC environment [ID 1412774.1]”

Comment by Hemant K Chitale — February 14, 2012 @ 6:21 am GMT Feb 14,2012 | Reply
- Hemant,
  
  As you say, it’s similar; but I don’t think it’s really the same because it looks like the (very old) problem of two RAC nodes operating briefly at different SCNs. Alas, the speed of light is not infinite – and even smaller when not passign through a vacuum.
  
  Comment by Jonathan Lewis — February 15, 2012 @ 2:31 am GMT Feb 15,2012 | Reply
Hello,

From what I understand, this behaviour means that every information that gets out of the database (be it information read by a human user, be it information communicated to another system) is unreliable in the sense that it may be information that potentially doesn’t survive an instance crash. If this is correct, it sounds like a serious issue to me, as also mentioned by Tony Hasler in his comment dated August 19, 2011. The potential consequences seem to be dramatic as pointed out in Tony Hasler’s blog “Why is ACID important?” (i.e. confirmed business transactions lost after an instance crash, messages sent out twice due to an instance crash). This and the fact that Oracle makes lot of effort to ensure the ACID property in other situations (such as disabling the PL/SQL COMMIT optimization for distributed transactions) leads me to the following two questions in this context:
1) Is the problem really as bad as it sounds to me or are there mitigating circumstances which I am ignoring?
2) Is there any official statement from Oracle Corp. regarding this matter since it was discussed publicly more than 6 months ago?

Thanks for any reply
kind regards
Martin

Comment by Martin Maletinsky — April 11, 2012 @ 8:01 pm BST Apr 11,2012 | Reply
Hello

After reading about this problem (originally in your book “Oracle Core”) my first thought was the problem could easily be fixed by reordering the sequence of events at commit time, i.e. the session only updates the transaction table slot after receiving confirmation from the log writer that the redo has been written to disk.

From your comment in the update you added as well as from your answer to Martin Berger’s post which you refer to in this update I understand that my approach would open another possibility for inconsistency – however, I don’t understand why.
To keep the text more readable, let me use “commit record” as a shorthand for “the change vector describing the transaction table slot update”. You say inconsistency could result if the instance crashed after the log writer has written the commit record to disk and before the session executing the transaction is informed about that write. However, if the instance crashes, I don’t see what difference it makes if the session managed to update the transaction table slot before the crash or not. The timing of this updates is not determinitstic anyway (even without a crash it may be delayed arbitrarily as a result of scheduling). Moreover the update affects volatile memory and is lost after the crash anyway – it will however be recovered from the redo that was written to disk upon restart of the database.

I see an issue, if the session terminates after the redo has been written to disk and before the session managed to update the transaction table slot. In this case PMON will roll back the open transaction and it is therefore necessary to prevent the transaction from being recovered from the redo logs after a subsequent instance crash. However, couldn’t this be solved by extending the cleanup work performed by PMON upon termination of the session, e.g. by having the PMON modify (delete) the commit record that was previously written to disk? I am aware that this would contradict the current philosophy of writing sequentially to the redo log and not having to modify redo data once it has been written in the past, however I can’t see a fundamental reason against it.

Thanks for any elucidation on this matter
kind regards
Martin

Comment by Martin Maletinsky — April 11, 2012 @ 8:12 pm BST Apr 11,2012 | Reply
- Martin,
  
  I know it is a bit unusual for somebody other than the blog owner to reply to comments but I am pretty sure I understand your confusion so with apologies to Jonathan….
  
  The key point is your statement “…after receiving confirmation from the log writer that the redo has been written to disk”
  
  What redo are we talking about? The redo associated with the changes to the permanent table blocks or the redo associated with the change to the transaction slot in the undo table space? If you mean the former then that doesn’t work because the changes will be rolled back after crash recovery. The thing that makes a transaction durable after a crash is the logging of the update to the transaction slot.
  
  So the way to implement what you are suggesting is:
  
  – Pin the undo segment header in cache
  – Log a change to the transaction slot without actually making the change to the transaction slot in memory
  – Get a confirmation from LGWR that the change is on disk
  – Actually make the change to the transaction slot in memory.
  – Unpin the undo segment header
  
  Of course, there are variations and there are complications should the client process crash or hang but you get the basic idea.
  
  This approach would work but violates a couple of fundamental aspects of Oracle architecture. As Jonathan explains in his book the process of generating redo is fundamentally tied into the process of making block changes.
  
  Despite the open heart surgery on the database engine that your suggestion requires, I believe that something like this would need to be done should Oracle ever be forced to fix this problem by somebody such as a major client or the powers that be at TPC.
  
  Comment by tonyhasler — April 11, 2012 @ 9:20 pm BST Apr 11,2012 | Reply
  - Hello Tony,
    
    Thanks a lot for your detailed reply (I didn’t expect a reply that quickly). Yes, it is the the redo associated with the change to the transaction slot in the undo table space I was referring to when I used the term “commit record” and it was this change I wanted to wait for to be written to disk before updating the transaction table slot. And yes again, the steps you list are exactly what I had in mind – only that I didn’t think about pinning the undo segment header. Is this pinning strictly necessary – wouldn’t this inhibit concurrency significantly (if you pin it in exclusive mode) or did you just intend to pin it in shared mode to avoid it from being evicted from the cache before it’s updated?
    
    However, I understand that Jonathan was saying this approach might lead to inconsistencies as well:
    “[…] and only lets them see the change after your session has been posted by lgwr (and even that leaves a window for inconsistent recovery if the instance crashes after the write has happened but before your session has been posted – see my reply to Martin Berger […]” in Jonathan’s update to his original post
    “[…] Even if we stopped session 3 from seeing the data until session 1 agreed that it the transasction had been made durable we could be in a position where the write to disc had occurred but instance failure meant that session 1 had not seen the acknowledgement; […]” in Jonathan’s reply to Martin Berger.
    
    I am aware that the approach is an open heart surgery, however it is my understanding that the problem to be fixed comes close to a serious heart attack (see my comment 27 above). I don’t know the Oracle source code but it seems hard to imagine that the problem can be fixed without significant modifications in central areas of the code.
    
    kind regards
    Martin
    
    Comment by Martin Maletinsky — April 11, 2012 @ 9:42 pm BST Apr 11,2012 | Reply
Hello

I have two questions related to the solution you suggest to the problem (in the update you added to the initial post). As in my previous post, let me use “commit record” as a shorthand for “the change vector describing the transaction table slot update”.

1) How do you deal with the scenario where the session owning the transaction terminates after it posted the LGWR but before it could update the transaction table slot from ‘commit waiting on lgwr’ to ‘committed’? Will PMON have to wait for the LGWR to write the corresponding commit record to disk and then do the update (or roll back the transaction in case the log writer crashes)?

—

2) In the fourth (last) bullet point you mention a consistency problem. If I understand it correctly, the problem results from multiple sessions committing transactions which might update the transaction table slots from ‘commit waiting on lgwr’ to ‘committed’ in a different (temporal) order than the (physical) order of the corresponding commit records in the redo log file.

a) Is this my understanding of the consistency problem correct?

b) (if yes to a) Why does this different ordering cause a consistency problem – could you please sketch a scenario how this can lead to inconsistent data?

c) If my understanding is correct and the different ordering does cause a consistency problem, am I correct in the assumption that by “waiting (or watching) for it to acknowledge” you mean the background process you mention waits for each process being woken up to update it’s transaction table slot before waking up the next process on the list (otherwise OS scheduling could mess up the order of the transaction table updates even though the processes are woken up in the right order).

d) Point c) brought me to another thought regarding the current COMMIT mechanism. Based on my understanding there is the potential that multiple sessions update the transaction table slots (from ACTIVE to COMMITTED) in a different (temporal) order than the (physical) order of the corresponding commit record in the redo log file. This could happen as a result of OS scheduling, let me sketch the scenario:
t1: Session 1 is about to commit Transaction TA1 and copies the commit record CR1 into the log buffer
t2: Session 1 is taken off the CPU by the OS scheduler
t3: Session 2 is about to commit Transaction TA2 and copies the commit record CR2 into the log buffer
t4: Session 2 updates the transaction table slot TTS2 from ACTIVE to COMMITTED
t5: Session 2 posts the log writer
t6: the log writer writes the redo log buffer to the redo log file (and thus the commit records CR1, CR2 in this order)
t7: Session 1 is put on the CPU again by the OS scheduler
t8: Session 1 updates the transaction table slot TTS1 from ACTIVE to COMMITTED (and doesn’t post the log writer because the record has already been written)
Now you have CR2 after CR1 in the redo log file but TTS1 was updated after TTS2 – can this lead to a consistency problem as well?

Thanks for your help to understand this matter
kind regards
Martin

Comment by Martin Maletinsky — April 11, 2012 @ 9:18 pm BST Apr 11,2012 | Reply
[…] included a demonstration of how Oracle actually fails the ACID test, which he blogged about afterward, definitely give it a […]

Pingback by A NoCOUG to Remember — March 1, 2013 @ 12:54 pm GMT Mar 1,2013 | Reply
[…] after some research this friend did, he found this post from Jonathan Lewis explaining a very rare but possible situation when these DIRTY READS could take […]

Pingback by How PARANOID are you with data? Or are there DIRTY READs in Oracle??? | EDUARDO CLARO — March 25, 2024 @ 8:48 pm GMT Mar 25,2024 | Reply