Oracle Scratchpad

OC 6 Writing and Recovery

Addenda and Errata for Oracle Core Chapter 6 Writing and Recovery

Back to Index

Addenda

p.123 In the Note on this page I mention the option for setting the redo log block size in 11.2; Charles Hooper has written a related note that you might want to read.
p.126 In the side-bar with the title “Messages” I mention the log file sync timeout, and how this has changed from being fixed at one second up to 11.1, but being configurable from 11.2.0.1 with a new default value of 0.1 second (10 centi-seconds). This point came up in a discussion I had recently about a RAC system that was showing a lot of “Block Lost” errors on just one of its two nodes; the link being that the other node was showing a very large number of log file sync timeouts. When a block is sent from one RAC node to another the sending node first has to flush the log buffer to disc, and if this takes too long the received will time out at 0.5 seconds and report a lost block.
p.143 In the Note on this page, I make some comments about the absolute and relative file numbers for a file. I failed to point out that the relative file number is “relative” to the tablespace. The numbering system is a bit strange, however, and this is for reasons of backward compatibility with Oracle version 7 which only allowed 1023 (or possibly 1022) datafiles in the database compared to the limit of 65,533 that appeared in version 8.In fact the limit for version 6 was only 63 (plus or minus 1) which is why it’s a good idea to use the functions supplied in the dbms_rowid package to translate data block addresses into (file number, block number) pairs rather than try to work out the convoluted way that Oracle mangled the bits to make version 7 backwards compatible with versin 6.

Errata

p.126 In the side-bar with the title “Messages” last para: “… where the start of current free space is …” should be “… where the end of current free space is …”.
p.129 The Note on this page references a couple of new instance statistics, but calls one of them “redo synch write time (usec)” – this should “redo synch time (usec)” without the “write”.
p.140 Figure 6-5, right hand portion: there are four arrows pointing at the top buffer header. The two shorter arrows (i.e. the ones attached to the lower part of the buffer header) should be pointing away from the buffer header.
p.145 Second paragraph last three lines (just above the heading “Checkpoints and Queues” the text reads “…it can unlink them from WRITE_AUX, relink them to REPL_MAIN, and…”; the relink goes to REPL_AUX, not REPL_MAIN

Typos/Grammar/Style/Punctuation

Back to Index

17 Comments »

  1. Jonathan,

    in the section Log Writer Writes, at the end of the note “MESSAGE”

    It’s easy for the session to check, of course, because it knows where lgwr had to get to satisfy it’s write(the session’s buffer#) and it can see where the current start of free space <– it should be "the current end of free space"

    after flushing the redo entries from log buffer to redo log files, the lgwr advances the end of free space. So that foreground session can compare its buffer# with the pointer.

    Comment by Sid — December 18, 2011 @ 7:57 am BST Dec 18,2011 | Reply

  2. Jonathan,

    In the Note section on page 143, the book states the following, “… For testing purposes you can set event 10120 before adding a data file to the database—this makes Oracle create a file with different absolute and relative numbers.”

    There really is not much information on the Internet about this event number, although I did find a small number of websites (for example http://www.adp-gmbh.ch/ora/tuning/diagnostic_events/list.html ) that suggested that the event number has an entirely different purpose:
    “10120 CBO Disable index fast full scan”

    A small test case:

    ALTER SYSTEM SET EVENTS '10120 TRACE NAME CONTEXT FOREVER, LEVEL 1';
     
    CREATE SMALLFILE TABLESPACE "TEST1" LOGGING DATAFILE 'C:\Oracle\OraData\OR1122P\TEST1.dbf' SIZE 5M EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;
     
    SELECT
      FILE#,
      RFILE#
    FROM
      V$DATAFILE
    WHERE
      NAME LIKE '%TEST1%';
     
    FILE#     RFILE#
    ----- ----------
        9         10
    

    The above test case suggests that the Note section in the book is correct. I just thought that I would mention the potential risk of confusion due to conflicting information found on the Internet.

    Comment by Charles Hooper — December 24, 2011 @ 1:02 am BST Dec 24,2011 | Reply

    • Continuing the previous test case, turn off event 10120, and create another tablespace with a single datafile:

      ALTER SYSTEM SET EVENTS '10120 TRACE NAME CONTEXT OFF';
       
      CREATE SMALLFILE TABLESPACE "TEST2" LOGGING DATAFILE 'C:\Oracle\OraData\OR1122P\TEST2.dbf' SIZE 5M EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;
       
      SELECT
        FILE#,
        RFILE#
      FROM
        V$DATAFILE
      WHERE
        NAME LIKE '%TEST2%';
       
      FILE#     RFILE#
      ----- ----------
         10         10
      

      Event 10120 clearly behaves as described in the book. Now my database has two datafiles with the same relative file number 10. Related to the Note’s comments, how would the user-facing code in Oracle Database that expects the relative file number react to the above situation – it seems as though one other key piece of information must be requested when the relative file number is required by user-facing code in Oracle Database.

      Referencing the documentation:

      http://docs.oracle.com/cd/E14072_01/server.112/e10595/dfiles001.htm

      “Absolute: Uniquely identifies a datafile in the database.
      Relative: Uniquely identifies a datafile within a tablespace.”

      Comment by Charles Hooper — December 24, 2011 @ 1:24 am BST Dec 24,2011 | Reply

    • Charles,

      Thanks for that. It prompted me to add a little note to the Addenda for the Appendix showing how to generate an up to date list of events in the10,000 range since that’s one of the things I used from time to time when investigating problems.

      I think event 10120 changed it’s meaning in the 8.1 time scale, the “CBO Index fast full scan” may have beeen the v7 and 8.0 effect, but I no longer have copies of those versions around to check. It’s a useful reminded, of course, that the events are (a) undocumented and (b) subject to change with no notice.

      I’ve also pointed outin the addenda for this chapter your follow-up detail that “relative” file number is nominally “relative to tablespace” – although when you look closely the numbering isn’t anything that would match what most people would think of as “relative” since Oracle doesn’t start each tablespace at (relative) file number 1.

      There’s plenty of scope for error when working with file numbers – for example, rowids show relative file numbers, not absolute file numbers:

      SQL> select rowid from t1;
      
      ROWID
      ------------------
      AAAYHMAAGAAAACKAAA
      
      1 row selected.
      
      SQL> select rowid from t2;
      
      ROWID
      ------------------
      AAAYHLAAGAAAAAZAAA
      
      1 row selected.
      

      The format of the rowid is: {data_object_id[6]} {relative_file_number[3]} {block number[6]} {rownumber[3]} so, in this case, both rowids come from relative file AAG (which translates into relative file 6 – which I can count to since AAA corresponds to zero).

      Since I happen to know that the two tables are in different tablespaces Oracle has to have some way of working out that there are two different absolute file numbers involved – and it can do this through the data object id, which can be used to determine the tablespace number for the segment.

      Comment by Jonathan Lewis — December 28, 2011 @ 7:10 pm BST Dec 28,2011 | Reply

  3. Here’s a question from Tony Hasler, orignally posted under the main index page for Oracle Core – but copied and answered here.

    I am not sure if this question belongs here or in chapter 6 but here goes.

    I am thoroughly confused by the concept of multiple public redo strands. Suppose three dependent transactions occur in the order T1 , T2, T3. Suppose the redo data for T1 and T3 are placed in one public redo strand and the redo data for transaction T2 gets placed in another. It seems that the redo log will eventaully contain the data in the order T1, T3, T2 or T2, T1, T3 depending on which strand’s buffer is written out by LGWR first. This would seem to suggest that any recovery process would need to have some mechanism for looking ahead and reordering redo before applying it but I have never heard of such a thing. I am similarly confused about how dependent transactions in RAC are recovered.

    Can you help me out?

    Tony,
    Your assumption about recovery having to “look ahead” and reordering redo before applying it is, I believe, correct; and your mention of RAC is extremely pertinent. When the question originally came up about how Oracle could handle recovery from parallel threads the first answer I heard was “it’s been doing it for years with RAC”.

    There are a couple of public clues about the mechanism – there is a statistic called “redo ordering marks” which is a little suggestive, and there has been a KK lock type for ages that is used as the “Kick” lock when one RAC instance forces another instance to switch log files in order to keep the log file sequences close to each other. Beyond that I don’t have any details – I could do some hand-waving and produce a hypothesis, but with your previous experience in designing logging mechanisms you could probably do at least as well as I could.

    The patents supplied by Timur may contain the information you need – but I haven’t read them yet.

    Reply made by Timuy Akhmadeev to Tony’s question at its original location:

    Take a look at patents http://www.google.com/patents/US5974425 and http://www.google.com/patents/US7039773.

    Comment by Jonathan Lewis — March 12, 2012 @ 8:10 am BST Mar 12,2012 | Reply

    • Jonathan,

      Timur’s references do, in fact, explain everything. I had found the patent for the previous OPS mechanism (with the default few second delay) but had searched in vain for a description of anything new. The records are indeed cached and sorted before application. Not only does this sort out the issues I raised but also allows a) records confirming the writing of a data block to disk to avoid any attempt to apply redo generated earleir and b) blocks to be updated in order optimising head movements. This algorithm isn’t rocket science but obviously only becomes practical with modern high memory systems. I haven’t read the patents in detail yet but do intend to. Seems like the contents would make for a good user group masterclass one day!

      Comment by tonyhasler — March 12, 2012 @ 10:41 am BST Mar 12,2012 | Reply

    • Hello,

      If I understand correctly the term “dependent transactions” it implies that T1 commits before T2 makes any changes which depend on changes made by T1 and T2 commits before T3 makes any changes depending on changes made by T2. If this is correct, than I believe the scenario described by Tony Hasler would not be possible however.
      If T1 commits before T2 makes any modification depending on T1 then all redo records generated by T1 will be written to the redo log files before any redo records are generated for the dependent modifications made by T2 and idem for T2/T3, no matter which redo strands are used by the transactions.
      Did I misunderstand something?

      Example:
      Assume transactions T1, T2, T3 which consecutively increment the value of one column for one row in a table i.e. “UPDATE T SET VAL=VAL+1 WHERE ID=:X;”. It is my understanding that this is what is ment by “dependent transactions” as mentioned in the scenario described by Tony Hasler.
      In this example T2 cannot lock the row in table T until T1 will have committed. I would assume that T2 will not generate any redo for the increment of T.VAL until it has locked the corresponding row. Once T2 can lock the row, this impiles that T1 has committed (or rolled back) and thus all redo records generated by T1 are in the redo log file, i.e. the redo records generated by T2 for the modification (increment) of the column value will be placed after the redo records generated by T1 in the redo log file.

      Thank you for clarification
      kind regards
      Martin

      Comment by Martin Maletinsky — March 14, 2013 @ 9:18 am BST Mar 14,2013 | Reply

      • Martin,

        At the hypothetical level given in the manuals you’re correct. In fact Tony’s starting scenario is possible because of a time-lag that he and I have both blogged about elsewhere.

        When a session issues a commit (on a small transaction), the sequence of events is this:

      • copy private redo buffer to public log buffer
        apply changes to data buffers
        post message to lgwr to write
        wait for callback
      • Note particularly that the transaction IS visibly committed by the change applied to the undo segment header being applied before the log writer has been called.

        Because of this it would be possible for a second transaction to see the results of the first committed (unwritten) transactions, and so on.

        It would, of course, be hard to prove that this had happened – but it would be easy to make it happen by mixing some very rapid, high volume updates in one session while performing some highly concurrent, very small transactions from another. As some point lgwr would be busy doing a large write to the log file giving the high frequency sessions time to go through Tony’s scenario.

        Comment by Jonathan Lewis — March 14, 2013 @ 2:26 pm BST Mar 14,2013 | Reply

  • Hello Jonathan,

    On page 141 you describe the scenario where LGWR is stuck because it cannot overwrite a redo log that contains changes that where not yet written to the data files by DBWR. You mention “… occasionally it means you need a faster device for your redo logs”. Shouldn’t that say “…you need a faster device for your data files”. As I understand it, the redo logs are already written too fast in this scenario (relatively speaking), thus I would not expect a faster device for the redo logs to relieve the problem.

    Thank you for clarification
    kind regards
    Martin

    Comment by Martin — March 18, 2013 @ 9:20 pm BST Mar 18,2013 | Reply

  • Hello Jonathan,

    On page 139 you describe how a buffer is getting on the checkpoint queue when it becomes dirty. You also write that the buffer’s LRBA is set – I understand this as being the Redo Block Address of the change record describing the change that made the buffer dirty. However, at the moment the buffer becomes dirty, this change record has not necessarily been written to the redo logs – it may still be in one of the (potentially multiple) log buffers. How can Oracle at this moment determine the Redo Block Address this record is going to be written to? This becomes even more difficult when taking into account the possibility that redo from private redo buffers may yet have to be flushed into one of the public redo log buffers and may thus end up in the redo log files before or after the change record mentioned before depending on the public redo log buffer it is written to and the order in which LGWR will write the public redo logs to disk.
    Could it be (this is just a guess) that for the purpose of checkpointing it is sufficient to have a lower bound of what I understood to be the LRBA, in which case it might be the Redo Block Address last written by LGWR at the moment the buffer became dirty?

    thank you
    kind regards
    Martin

    Comment by Martin — March 18, 2013 @ 9:37 pm BST Mar 18,2013 | Reply

  • Hello Jonathan,

    Starting on page 151 you describe the scenario that leads to the “log file switch (private strand flush incomplete)” message in the alert log. I think I got the big picture, but there are a couple of things I didn’t understand.

    1) What does “triggers a log file switch prematurely” exactly mean in the context it appears on page 151 – does LGWR start writing into the new log file at this very moment or does it mean the processing you describe subsequently (DBWR flushing private redo …) is initiated?

    2) After the log file switch is triggered prematurely, DBWR flushes private redo into the public buffers as you describe on page 152. How does the redo finally get into the (old) redo log file? Is it DBWR that (exceptionally) writes this data into the redo log files (as you write “the database writer gets it [the redo data] there [into the log file]“) or is it LGWR? However, if it was LGWR, wouldn’t the data end up in the new log file, as the log file switch has already happened (see question 1)?

    3) When Oracle decides at moment t to switch from log file L1 to log file L2, I believe it must ensure that (i) everything that was in the private or public redo buffers at time t will go into log file L1 and that (ii) no redo generated after time t will go into log file L1. (i) is to ensure that all the redo records in each log file are between this file’s starting SCN and the following file’s starting SCN as you describe on page 151 (I don’t understand exactly why this is necessary, but it seems plausible that e.g. in some scenarios the code needs to be sure it got all redo records up to a certain SCN). (ii) I believe is necessary to ensure that L1 can be safely overwritten once the corresponding media recovery checkpoint completes (as DBWR will have written to disk the data blocks that were modified up to time t, but blocks modified more recently may still not be up to date in the datafiles).
    Does that imply that Oracle will stop redo generation for all sessions (not only those using private redo) until the private strands are flushed into the public redo buffers? This might be necessary to prevent some redo generated before time t from ending up in log file L2 despite the space that was reserved in L1 at the premature log file switch (which would compromise (i)) and also to prevent new redo generated after time t to end up in L1 (which would compromise (ii)). Does Oracle even stop such redo generation until the LGWR has written the entire public redo buffers (including the flushed content from the private redo buffers) into log file L1 or are there other ways (i.e. telling the LGWR after which redo buffer entry it should start writing into the new log file L2)?

    Than you
    kind regards
    Martin

    Comment by Martin — March 18, 2013 @ 10:44 pm BST Mar 18,2013 | Reply

  • Jonathan,
    (purely nitpick here)
    Error in Errata above:

    p.126 In the side-bar with the title “Messages” last para: “… where the start of current free space is …” should be “… where the end of current free space is …”.
    should be:
    p.126 In the side-bar with the title “Messages” last para: “… where the current start of free space is …” should be “… where the current end of free space is …”.

    This is one of the least expensive books I have ever purchased, given what rich insights it provides!
    Thanks, Merrill

    Comment by Merrill B Lamont Jr. — December 15, 2013 @ 7:54 pm BST Dec 15,2013 | Reply

  • Hi Jonathan,

    Since there are series of changes to undo blocks and data blocks during any transaction, at what point of time the change vectors related to undo blocks and data blocks are written to log buffers?

    Regards, Vipan

    Comment by vipankumarsharma — August 30, 2014 @ 12:29 pm BST Aug 30,2014 | Reply

  • To further elaborate on my question, are all the changes follow the write ahead policy, i mean changes to undo segment header at the start of the transaction and changes related to locking rows in the table.

    Regards,
    Vipan

    Comment by vipankumarsharma — August 30, 2014 @ 1:06 pm BST Aug 30,2014 | Reply


  • RSS feed for comments on this post.

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    The Rubric Theme. Blog at WordPress.com.

    Follow

    Get every new post delivered to your Inbox.

    Join 4,014 other followers