Oracle Scratchpad

February 17, 2011

Philosophy – 14

Filed under: Philosophy — Jonathan Lewis @ 6:48 pm BST Feb 17,2011

Paraphrasing Yogi Berra:

    “It ain’t committed until it’s committed.”

If you’re wondering why it’s worth remembering this odd comment – it addresses the (commonly asked) question:

    “does the redo log contain uncommitted data as well as committed data?”

The answer is: yes.

When a session is creating redo change vectors it doesn’t know whether it is going to commit or rollback. But a session has to be able to store an arbitrarily large list of change vectors somewhere, and that list has to appear in the redo log (ideally “instantly”) if the session commits – so Oracle avoids delays on commit by putting the change vectors into the redo log as they are created***.

If you view the question from the opposite extreme, the recovery mechanism has to be able to deal with uncommitted data anyway because there are, after all, several scenarios where data that definitely was committed cannot be recovered; for example, recovery until end of log file 9998 because log file 9999 was destroyed and simply doesn’t exist – how can the code handle transactions that were not committed until part way through file 9999 if it only knows how to handle committed transactions ?)

*** Not strictly true from 10g onwards where Oracle introduced a delaying effect aimed at reducing competition for the redo allocation and redo copy latches for “small” transactions.

[The Philosophy Series]

13 Comments »

  1. Hi Jonathan

    Could you please clarify the 9998/9999 scenario.

    Whether the entire transaction was in 9999, or half/half in 9998/9999, it would still be lost.

    But that’s because file 9999 is lost, what does it have to do with uncommitted data in redo logs?

    Alex

    Comment by Alex Nedoboi — February 17, 2011 @ 9:24 pm BST Feb 17,2011 | Reply

  2. I don’t get the following statement: “the recovery mechanism has to be able to deal with uncommitted data anyway because there are, after all, several scenarios where data that definitely was committed cannot be recovered”.
    Why does the fact that sometimes committed data cannot be recovered imply that the recovery mechanism has to handle uncommitted data?

    Comment by Patrick — February 18, 2011 @ 9:23 am BST Feb 18,2011 | Reply

  3. I’ll take a stab:

    Let’s say you have an order header and an order line added in one transaction. The header is in 9998, the line is in 9999, and the commit is in 10000. db writer has optimistically written the blocks out. System crashes, 9999 is lost. Recovery rollforward happens, 9998 is replayed, header is never committed and so rolls back… what about line? The recovery mechanism has to handle it somehow.

    Comment by joel garry — February 18, 2011 @ 6:41 pm BST Feb 18,2011 | Reply

  4. It’s clear that my comments about a transaction spanning log files didn’t supply the flash of insight I was hoping for (at least, not to everyone – but then, different people need different mechanisms to aid their understanding.)

    My point was this: if you think that uncommitted data in the redo log stream might cause a problem during recovery then you can put that thought aside because ALL data in the recovery stream is uncommitted data until the recovery process reaches the commit record.

    To emphasise this point, I hypothesised a transaction that started in log file 9998 and had its commit record in 9999. If we then find we have to do a database recovery, we know that eventually we ought to see the effects of the transaction that committed in file 9999. We, the human agent, know that it was committed.

    On the other hand, if file 9999 has been destroyed by the crash that made us start the recovery, then Oracle will get to the end of file 9998 and have no more redo to apply. At that point, Oracle does not know whether the redo it has been applying belonged to a transaction that was originally committed, rolled back, or still in-flight. From Oracle’s perspective, every redo entry comes from an uncommitted transaction until the transaction commit record appears. “It ain’t committed until it’s committed.”

    Comment by Jonathan Lewis — February 18, 2011 @ 7:21 pm BST Feb 18,2011 | Reply

  5. Jonathan,

    “ALL data in the recovery stream is uncommitted data until the recovery process reaches the commit record” – very well said and I completely agree.

    However, my question still stands.

    Scenario (1). Imagine Oracle wrote commited data only. A transaction spans between 9998 and 9999. File 9999 is lost. The transaction is lost.

    Scanario (2). Oracle writes uncommited data. A transaction spans between 9998 and 9999. File 9999 is lost. Transaction is lost.

    The two scenarios only differ in one thing – in (1) the commit record comes before the transaction (actually it doesn’t even have to be in the file, but is implied), and in (2) it comes after the transaction.

    But you see, the transaction is lost regardless. Not because uncommited data had been written, but because file 9999 is lost. Hence was the original question.

    PS There is an exception to this of course. When someone runs some adhoc, goes to lunch, and then commits it after getting back.

    Alex.

    Comment by Alex Nedoboi — February 19, 2011 @ 6:13 am BST Feb 19,2011 | Reply

    • Alex,

      If you wrote only committed data to the log file I think you would still have to have an “transaction complete” marker as part of the stream and logically I think it would have to come at the end of the stream otherwise you run into the problem of how the software could determine that there hadn’t been some data loss or corruption on the way through the transaction.

      Having said that, the important difference in the two scenarios is that in the case of using file 9999 only, the recovery process doesn’t have to deal with any data for that transaction whereas when the data is split across files the recovery process starts to process a “known committed” transaction but has to have a code path to deal with the case of “transaction not completed, get rid of the changes that shouldn’t have been made”. So even if you think you can make the code “simpler” by only writing committed transactions you find that you still have to have the code that can handles “uncommitted” transactions.

      Comment by Jonathan Lewis — February 19, 2011 @ 8:58 am BST Feb 19,2011 | Reply

  6. Jonathan,

    To the first paragraph.

    Not sure how it’s relevant to our discussion, but I’d still love to comment. It’s that classic “C strings vs Pascal strings” argument. Well, there weren’t strings in C per se, so a string would basically be represented by a pointer to whatever many characters with a 0x00 character at the end.

    In Pascal though, the first byte/2xbytes/4xbytes would indicate the length and the rest would be the actual string.

    Oracle, as we know, follows Pascal’s way representing varchars (and even chars).

    Therefore, in the log file, it would be sufficient to say – alright, I’m going to write these three change vectors, MD5’s/lengths are respectively blah/blah, blah/blah, blah/blah.

    So the recovery process wouldn’t initiate the rolling forward of that particular transaction until it read the three vectors and made sure the MD5’s matched.

    To the second paragraph.

    Guess we are finally getting to the point, and just as you thought I didn’t get what you had said in your original post, I’m thinking you’re not getting what I said in my reply.

    My point was: it is very unlikely for the entire transaction (but not the commit record) to be in file 9998, with the commit record in 9999.

    Being a developer (DBA-ing (DBA-tion?) is just a hobby of mine), I’m not sure how to prove it (log mining, but how exactly), however I have a feeling that’s the case.

    The reasoning is that most transactions would commit immediately after the last DML. Therefore if the transaction starts in 9998 and commits in 9999, it is very likely that some part of the transaction is in 9999. Hence the PS about adhocs.

    So if you can’t recover the transaction, it’s not because Oracle writes uncommitted data, but because file 9999 is lost.

    Comment by Alex Nedoboi — February 20, 2011 @ 8:06 am BST Feb 20,2011 | Reply

    • Alex,

      It’s quite interesting to discuss alternative strategies that Oracle could have used for such things as redo logging – and to spot flaws in their strategy and in your replacement strategy. In your example:

        So the recovery process wouldn’t initiate the rolling forward of that particular transaction until it read the three vectors and made sure the MD5′s matched.

      The drawback to this approach is that Oracle would have to record “somewhere” the entire stream of redo log for any long-running transaction until it had acquired the entire transaction stream – that’s the equivalent flaw to the problem I highlighted with “WRITING only committed data”.

      Regarding the discussion at cross-purposes:

        So if you can’t recover the transaction, it’s not because Oracle writes uncommitted data, but because file 9999 is lost.

      Yes, obviously, but the point wasn’t about whether or not you could recover the transaction, it was about how Oracle doesn’t need to be able to tell the difference between “uncommitted” and “committed” change vectors when it’s writing or reading the redo log, and how “writing only committed change vectors” doesn’t reduce the complexity of recovery.

      Comment by Jonathan Lewis — February 22, 2011 @ 11:58 am BST Feb 22,2011 | Reply

  7. Alex,

    Let me add to Jonathan’s clarification :

    a. A “Transaction” consist of
    i. 1million individual DELETE statements for 1 row each followed by a single COMMIT
    ii. One single DELETE statement that deletes 1million rows, followed by a single COMMIT.
    So your assertion that “it is very unlikely for the entire transaction (but not the commit record) to be in file 9998, with the commit record in 9999″ is unlikely.

    b. While that 1million row DELETE is running, there are 15 other transactions (from 15 other sessions) occurring concurrently in the database. Quite obviously, the Redo for all the 16 transactions cannot fit into the Redo Buffer. Therefore, it flows to disk. Also, it will be interleaved. (The other 15 transactions will not wait for this million row DELETE to complete before entering their redo !).
    So, the End of Transaction marker for our million row delete is far separated from the beginning.
    Oracle has to be able to match the COMMIT for this transaction with all the redo entries for the transaction.

    c. Undo has to be captured into redo. Else, Oracle cannot rollback a failed/incomplete transaction on Instance/Media Recovery. Therefore, the redo for that transaction also includes the undo — i.e. all the changes that have to be applied to undo blocks.

    As Oracle replays log 9998 to the database during a recovery, it replays all 16 transactions including their undo. It cannot choose to “skip” the other 15 transactions until it completes the first one and then go back to the beginning of each of the other transactions. It would spend far too much time looping through the redo.
    (Imagine 250 transactions spanning 200 redo logs !).
    Only when it finds the commit marker for our 1million row DELETE does it know that the delete has committed. It has no way of knowing about the commit till then. Commit is always “in the future” as it applies redo during a recovery.

    If sequence 9999 is missing, it doesn’t find the commit marker. Rollback of the failed transaction is later done from the undo image that has been rolled forward.

    Imagine 200 transactions of 1row to 10million rows, of 30bytes to 10GB each, spanning across the redo trail. Oracle has to find the commit marker for each of these 200 transactions in the redo trail. It cannot know in advance that a particular transaction’s commit is missing/unavailable because it doesn’t even know which redo log file that commit marker will appear in.

    Hemant

    Comment by Hemant K Chitale — February 22, 2011 @ 7:07 am BST Feb 22,2011 | Reply

  8. Hemant,

    Your transaction can take seconds, or minutes, or hours. It can affect one row, or one million rows, or one billion rows.

    It all doesn’t matter, because the COMMIT will always take a fraction of a second.

    Even if we, for simplicity’s sake, say that COMMIT normally takes one second and log switches occur every hour, then the probability of the last change and the commit record being separated is 1/3600 = 0.00028, or 0.028%

    Alex.

    Comment by Alex Nedoboi — February 22, 2011 @ 10:49 am BST Feb 22,2011 | Reply

    • Alex,

      But why is it only the time between the LAST change and the commit that matters ?

      Comment by Jonathan Lewis — February 22, 2011 @ 11:59 am BST Feb 22,2011 | Reply

      • Jonathan,

        You catch a bus to get from A to B. There can be any number of stops in between or no stops at all. The trip is finished when the driver opens the doors at stop B. Opening the doors is the commit.

        The doors are normally opened as soon as the bus stops at stop B. The driver doesn’t stop at stop B and then wait for ah hour before opening the doors.

        I agree with you that if the bus does get to B in time, but the doors are not opened for an hour, and as a result you’re late, that would be a shame. If you know that the transaction had finished but you can’t recover it because it wasn’t committed until after an hour, yes, that would be a shame.

        What I’m saying is that it is very unlikely for the doors to not open for an hour, and for the transaction to not be committed for an hour. The commit normally happens immediately after the last bit of the transaction (adhocs aside), whatever many stops steps the transaction consists of.

        Alex.

        Comment by Alex Nedoboi — May 29, 2011 @ 3:02 am BST May 29,2011 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,115 other followers