Oracle Scratchpad

September 23, 2007

Faster backup

Filed under: Infrastructure,Performance — Jonathan Lewis @ 7:40 pm GMT Sep 23,2007

Someone sent me the following question recently:

I am the DBA for a 4 TB datawarehouse.  It has about a thousand plus datafiles.  I use the EMC bcv establish process for backing up the database.  When splitting the database I place all the tablespaces in begin backup mode, suspend (alter system), split, resume (alter system) and end backup of all tablespace. 

In 9i and prior to 9i it used to take about 30 minutes to place all tablespaces in begin backup.  Ever since I upgraded to 10gr2, it only takes one minute to place all these tablespaces in begin backup. Could you please throw some light on this ?

The answer is, as so often, I don’t know.  (Partly because I don’t own an EMC, and partly because no-one has asked me before). I am, however, prepared to float a hypothesis.

All other things being equal – and they rarely are if you’re comparing 9i with 10g – the change may relate to tablespace checkpoints and write optimisations.

  • When you issue “alter tablespace xxx begin backup” Oracle issues a tablespace checkpoint, which means it writes to disc all dirty blocks in the db cache that belong to that tablespace – and those writes must complete before the command returns. (That’s not a new feature, by the way – it’s been like that for years).
  • In 10g when dbwr finds several adjacent blocks to write it can “coalesce” them into a single large write – which can make a difference to the way the I/O subsystem has to work. This feature is controlled by the hidden parameter _db_writer_coalesce_area_size and becomes visible in v$sgastat under the shared pool names: dbwriter coalesce bitmap, dbwriter coalesce struct, dbwriter coalesce buffer.

In normal checkpointing activity dbwr walks the checkpoint queue in order (see this link for an old, but reasonably accurate, note on the checkpoint queue) and writes a few blocks from the tail end of the queue; but for a tablespace checkpoint there may be a lot of blocks from one tablespace to write, and if you take them out of queue order you may be able to use the coalesce trick quite frequently to make a significant dent in the I/O overhead.

I’m not going to claim that this is the correct answer to the original question – but it does seem a reasonable hypothesis. As a test, the person asking the question could check v$filestat just before and just after doing the ‘begin backup’ to see if the statistics for the relevant file suggest that a number of multiblock writes have taken place.

Historically, a multiblock write was a sign of a direct write (usually on the temporary tablespace, but also visible for index rebuilds, direct loads etc.)  From 10g, a multiblock write could appear any place, any time, thanks to the dbwriter coalesce feature.

Footnote: on a more mundane note, it is possible that the 9i system had an explicit setting for log_checkpoint_timeout (typically 1800 seconds) whilst the 10g system has been set to use the self-tuning fast_start_mttr_target – which could have reduced the notional log_checkpoint_timeout to a much smaller value, thus keeping the number of dirty blocks in the buffer to a much lower volume. A check on v$instance_recovery (particularly the target_redo_blks, and log_chkpt_interval_redo_blks) would confirm this hypothesis.

21 Comments »

  1. Kevin Closson already investigated the “dbwriter coalesce” feature in 10gR2, not in the context of a tablespace backup (but possibly while checkpointing caused by log switches, a similar operation just made on all tablespaces):

    http://kevinclosson.wordpress.com/2006/12/01/dbwr-multiblock-writes-yes-indeed/

    what I’ve found intriguing is that (click on the second image) blocks are written (almost) in order, first by datafile, than by block id [Kevin has filtered out single block writes in the grep, but probably they are sequential as well].

    If confirmed to be true, I think this is even more important than the coalescing of adjacent blocks: writing in order means less disk head movement, hence higher throughput from the disk array cache to the physical disks (so the cache fills completely less often, and even if it fills, the physical bottleneck is less severe). Coalescing would be just an additional improvement made possible by the DBWR walking the blocks in order.

    Comment by Alberto Dell'Era — September 23, 2007 @ 11:32 pm GMT Sep 23,2007 | Reply

  2. Alberto,

    That’s a very good point; especially when you consider the “sector” sizing of typical SAN caches. I believe EMC uses a 32KB memory chunk as the unit of cache, and I’ve just heard that HP’s XP12000 uses 256KB.

    If blocks arrive sufficiently out of order, it seems possible that you could invalidate the same 256KB “sector” several times in the course of a checkpoint because of simple timing; whereas if the blocks arrived in an ordered fashion you might use a sector for several near-neighbours just once.

    We’ll have to ask Kevin to comment on that.

    Comment by Jonathan Lewis — September 24, 2007 @ 6:33 am GMT Sep 24,2007 | Reply

  3. The mundane note in the initial part of the blog (not in the comments section) is very true in my case. In 9i checkpointing was done through the setting of log_checkpoint_timeout to 1800 (the default). In 10g that was changed to fast_start_mttr_target.

    Comment by Venky Devanarayanan — September 25, 2007 @ 1:16 pm GMT Sep 25,2007 | Reply

  4. We’ll have to ask Kevin to comment on that.

    …I don’t have much to say–other than this stuff is entirely too complicated (storage processors that is). Just give me a few hundred simple drives directly attached:

    http://kevinclosson.wordpress.com/2007/08/13/hard-drives-are-archane-technology-why-cant-i-realize-their-full-bandwidth-potential/

    Comment by kevinclosson — September 25, 2007 @ 9:23 pm GMT Sep 25,2007 | Reply

  5. Another note here. I am wondering about that step to suspend and then resume IO.

    If the tablespaces are already in backup mode, chances are that their SCN is going to be frozen at slightly different points in time (unless they use alter database begin backup on which case i haven’t confirmed whether it’s the same SCN for all), so recovery is going to be needed anyway (it always is, we’re making the datafiles dirty after we put them in hot backup mode anyway ), so, why bother quiescing the system?

    Just apply redo until you’re out of the backup status (til alter tablespace/database end backup).

    Comment by Unai Basterretxea — September 28, 2007 @ 11:03 am GMT Sep 28,2007 | Reply

  6. Unai,

    The “suspend/resume” may be a directive from the hardware supplier; or a suggestion from Oracle for dealing with a specific hardware supplier.

    As you say, the suspend/resume does not appear to be necessary from Oracle’s perspective – but it may be a feature of the layer below Oracle that all Oracle I/O has to be stopped briefly so that the snapshot can be started.

    Comment by Jonathan Lewis — October 1, 2007 @ 10:45 am GMT Oct 1,2007 | Reply

  7. Oh!!! So informative.

    Comment by Asif Momen — October 2, 2007 @ 11:34 am GMT Oct 2,2007 | Reply

  8. Here’s an example of certain EMC products where SUSPEND/RESUME is necessary.
    Unfortunately, the White Paper is no longer available on EMC’s site (probably, all their White Papers now need an account that is available only if you are a customer paying for support, I guess).

    See http://www.freelists.org/archives/oracle-l/08-2004/msg01472.html

    Comment by Hemant K Chitale — October 31, 2007 @ 3:34 pm GMT Oct 31,2007 | Reply

  9. On the matter about DBWR and coalesce of buffers.

    In my testing, I’ve noticed that when you explicitelly issue a checkpoint or a fast object checkpoint (through drop table or truncate) dbwriter *does not* do buffer coalescing.

    However on normal agingout it does.

    This is a bit unfortunate, as if you have a DBWR busy system, if you issue “alter system checkpoint” you will actually slowdown the system.

    P.S.
    That is with async io enabled and ASM.

    Comment by Christo Kutrovsky — November 26, 2007 @ 5:02 pm GMT Nov 26,2007 | Reply

  10. Dear Jonathan,

    With a database using RMAN to backup – also using “Block Change Tracking” – how do we know how many blocks need to backup at a point after an incremental backup level 1.

    Thank you for your time.
    K.P

    Comment by K.P — July 3, 2013 @ 6:45 am GMT Jul 3,2013 | Reply

  11. Hello Jonathan,

    A question about DBWR write operations: when a DBWR process is writing blocks from the SGA (regardless of coalescing), are the blocks copied to another area (such as PGA) when making the OS call to write? If yes, then why do we have write complete waits? Once the blocks are copied, they should be made writable again, and not have to wait for the IO to complete?

    Thanks,
    Naresh

    Comment by Naresh — November 21, 2013 @ 1:15 am GMT Nov 21,2013 | Reply

    • Naresh,

      There’s always scope for change, but basically the answer to your question is no (don’t create work that can be avoided). The database writer pins the block exclusively in the buffer cache to write it.

      Having said that, there may be a relatively new mechanism that allows multiple block writes to be coalesced by Oracle, so perhaps there are cases where the database write sees blocks in a write batch that are consecutive blocks from the same file and copies them into a separate area in the SGA before writing them. I’m basing this guess on the observation of an entry in v$sgastat for a shared pool structured named “dbwriter coalesce buffer” – but my guess may be wrong.

      Comment by Jonathan Lewis — November 29, 2013 @ 9:44 am GMT Nov 29,2013 | Reply

      • Thank You for your response Jonathan.

        I have seen the write complete waits get quite severe on databases at times when multiple parallel queries are started (say during the start of and “end of day” batch job schedule) due to the DBWRs getting busy with all the object checkpoints. Depending on how much DML the SGA has taken in the recent past, the DBWR writes go higher than a 1000 blocks per “db file parallel write” call (looking from the p1 or p2 values at these times). The write complete waits in turn can lead to “buffer busy waits” in case of the index blocks for a table which has a sequence based key (say for a “transaction” table that is populated by all OLTP activities).

        It seems copying into a PGA area in all cases – not just the coalesce buffer scenario – would be a comparatively simple operation (compared to all the latch/queues related logic that is otherwise part of accesing any SGA block) and the benefit would be considerable. Even at a 1000 8K blocks, thats about 8 MB memory (plus some housekeeping overhead) – so not that much.

        Just a thought of course.

        Comment by Naresh — November 29, 2013 @ 12:56 pm GMT Nov 29,2013 | Reply

        • Naresh,

          It might be worth reviewing exactly why the “write complete wait” waits are happening. If your solution is to copy the block into the PGA (of DBWR, presumably) before writing it then you’re just making another copy of the block – and Oracle already has an optional action that reports statistics on “write clones created in background” / “write clones created in foreground” so perhaps there’s a limiting reason why copying is not considered possible in all cases. (Of course these might stats might only be relevant to “write clones created for recovery” – I haven’t checked)

          It can’t be a PGA copy, by the way, because you can have multiple DB Writers, so the copy would have to be to a shared memory area.

          Comment by Jonathan Lewis — December 10, 2013 @ 9:53 am GMT Dec 10,2013

        • Interesting about the “write clones %” statistics – never paid attention to them. I see they are zero on most databases I work on – but do show numbers on some databases. I will pay attention to these the next time the write complete waits occur. Given the write clone mechanism, there should be potentially no write complete waits at all, if I understood it right (but I notice you said it is an “optional action” in your comment).

          Comment by Naresh — December 13, 2013 @ 5:16 pm GMT Dec 13,2013

        • Naresh,

          Just one other thought – even if a process were to copy blocks to a different memory location before writing them it would still have to pin them in exclusive mode – which would normally cause a conflict called “buffer busy waits”, but which in the case of DBWR could still be called “write complete waits”.

          Comment by Jonathan Lewis — December 13, 2013 @ 5:24 pm GMT Dec 13,2013

        • Jonathan,

          Why would the block need to be pinned once it is copied? The process would have all information it needed to complete the write independent of the SGA buffer. Even if the SGA buffer gets updated after that (and prior to completion of the write) what harm will it cause?

          Comment by Naresh — December 14, 2013 @ 2:18 pm GMT Dec 14,2013

        • Naresh,

          The process doing the copy would have to pin the original exclusively to make sure it wasn’t changed while the copy was taking place. Copying to memory or writing to disk are (very loosely) the same action, just operating at different speeds.

          Bear in mind that if you did implement copying to memory you’d have to implement a mechanism that allowed another process to modify a copied block and then forced a LOCAL write – which would mean the write process would have to pin it while checking that there wasn’t a copy in the write batch waiting to be written.

          Comment by Jonathan Lewis — December 14, 2013 @ 2:51 pm GMT Dec 14,2013

        • OK – I get that. Thanks Jonathan for your thoughts.

          Comment by Naresh — December 14, 2013 @ 6:33 pm GMT Dec 14,2013


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,306 other followers