Oracle Scratchpad

November 21, 2006

Free Buffer Waits

Filed under: Execution plans,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 8:53 am GMT Nov 21,2006

I think that many DBAs are likely to be familiar with the event “free buffer waits” and recognise it as an indication that DBWR (the database writer) is not clearing the buffer cache of dirty blocks fast enough. But that’s not the only cause of free buffer waits.


If a session needs to load a block from disk or make a consistent read clone of an existing buffer, it needs to be able to empty another buffer and re-use it. But if all the buffers are either dirty or pinned the session has to wait for a buffer to become available – either when the database writer cleans some buffers (by copying them to disc) or when some pinned buffers are finally released. (See, for example this note from Steve Adams – written in Feb 2000 before 9i went on general release – for further comment). Given the size of buffer caches these days, and the limit on how many buffers a single session may pin, the “free buffer waits” is more likely to be a database writer problem than anything else.

However, I came across a surprise variation on this problem a few weeks ago (on a system). On a totally repeatable test run of a (very busy) web-based application, my client was reporting excessive time lost on “free buffer waits” – even though he could see from a simple query against view v$bh (the buffer headers) that the db cache was half empty before, during, and after the test.

So for the next test run I ran my snapshot code against x$kcbwds (the “working data sets” object) to see how localised the waits were. The snapshot is (roughly) the following query:

select set_id, fbwait, wcwait, bbwait from x$kcbwds;

The free buffer waits (fbwait) were spread evenly across just 24 adjacent sets of the entire list – and a quick check of v$buffer_pool showed that those working data sets made up the recycle cache.

So, one step closer to the understanding the problem, I asked the client which objects had storage clauses that assigned them to the recycle cache. The answer was none. Naturally I queried dba_segments to check this – since something clearly was getting into the recycle cache – and sure enough, there really were no objects defined to use the recycle cache.

So I whipped up a little query to find out what was in the recycle cache – the SQL, execution plan, and a couple of comments follow. This isn’t a script that I would advise you to use on a production system because it’s just a quick hack I put together for a test system – and I haven’t worked out if (a) it’s actually completely correct and (b) there any consequential side-effects.

rem     What objects and blocks are in the recycle pool

        /*+ ordered use_hash(xbh) cardinality(wds 24) */
        x$kcbwds        wds,
        x$bh            xbh
        wds.set_id between (
                        select min(bp_lo_sid) from x$kcbwbpd where bp_name = 'RECYCLE'
                   and     (
                       select min(bp_hi_sid) from x$kcbwbpd where bp_name = 'RECYCLE'
and     xbh.set_ds = wds.addr

set doc off
        Expected execution plan - the hash join with x$bh as
        the probe table is probably the only sensible option
        Note - the calls to check x$kxbwdp are actually running
        as child operations to a "hidden" FILTER on x$kcbwds

Execution Plan
0      SELECT STATEMENT Optimizer=CHOOSE (Cost=44 Card=24 Bytes=1848)
1    0   HASH JOIN (Cost=44 Card=24 Bytes=1848)
2    1     FIXED TABLE (FULL) OF 'X$KCBWDS' (Cost=22 Card=24 Bytes=456)
3    2       SORT (AGGREGATE)
4    3         FIXED TABLE (FULL) OF 'X$KCBWBPD' (Cost=22 Card=1 Bytes=25)
5    2       SORT (AGGREGATE)
6    5         FIXED TABLE (FULL) OF 'X$KCBWBPD' (Cost=22 Card=1 Bytes=25)
7    1     FIXED TABLE (FULL) OF 'X$BH' (Cost=22 Card=100 Bytes=5800)


The results of this query were very interesting – and I’ve reproduced a few lines from the output below. For reference, state=3 decodes to ‘CR’ (consistent read).

---------- ---------- ---------- ----------
        49      52674    1020720          3
        49      52674    1020720          3
        49      52674    1020720          3
        49      52675    1020720          3
        49      52675    1020720          3
        49      52675    1020720          3
        49      52676    1020720          3
        49      52676    1020720          3
        49      52677    1020720          3
        49      52680    1020720          3
        49      52680    1020720          3
        14          9 4294967295          3
        14          9 4294967295          3
        14          9 4294967295          3
        14          9 4294967295          3
        14          9 4294967295          3
        14       1289 4294967295          3
        14       1289 4294967295          3
        14       1289 4294967295          3

Object 1020720 was the most popular data segment in the entire system – which just happened to be allocated to the keep cache, not the recycle cache. As you may have guessed from the block numbers, the blocks from the undo segments (identified by obj = 4294967295) were all segment header blocks. You’ll also notice that there are multiple CR blocks for nearly every block I’ve reported.

Somewhere I once wrote a note about Oracle’s intention to create CR copies in the recycle cache for objects assigned to the keep cache in order to protect the keep cache from excessive garbage. But at the time I pointed out that I had never seen this feature working. Finally, I have an example of this mechanism in action; and the trick even includes CR clones of undo segment header blocks (the things you’ve managed to generate when you see statistics about “transaction tables consistent read rollbacks” in v$sysstat).

My interpretation of the waits, then, was this: because the recycle cache was there it was being used as a dumping ground for CR blocks of a few special objects. But because the system was so busy (and the recycle cache so small) the system regularly got into a state where a session needed to create a CR block for one of the special objects at a moment when all the buffers in the recycle cache were still pinned (They’re CR blocks, so never dirty, hence never written). So the session waited for a buffer to be unpinned rather than create the CR copy in the keep or default caches.

It would have been nice to run further checks on this hypothesis – were there any CR copies of the critical objects in the keep or default caches, for example, and if so did their flags differ at all from those in the recycle cache. Unfortunately there was no time for further investigation. (Oracle can copy the current block to a new buffer and mark the older block as the CR copy, for example – an event recorded under statistic “switch current to new buffer” and potential source of extra redo log writes and “gc cr block flush time” in RAC.)

The client had created a small recycle cache “just in case” and didn’t really need it, so we got rid of it and re-ran the test suite. Inevitably (well, almost inevitably) there were no free buffer waits recorded.

So, in a busy 9i environment, free buffer waits might be a sign of DBWR failing to keep up – but it’s just possible that they might be an indication of some sort of conflict between the recycle cache and volatile objects in the keep cache. It’s worth a quick check of the buffer pool statistics (v$buffer_pool_statistics for the ‘official’ view) to see which caches the waits appear in.

Update Mar 2011:

Following on from a question about the KEEP pool that appeared on OTN, I’ve repeated a few of my tests on and – and there have been some significant changes. In (which is just the version I happened to test, the change might have come in much earlier, maybe even 10.1) Oracle is able to use the DEFAULT cache to create CR copies of blocks in the KEEP cache if you haven’t created a RECYCLE cache. In the number of cases where the CR block go into the DEFAULT cache has increased.

However, when you update a block, Oracle may clone the block to create a new CURrent copy and re-label the previous copy as a CR copy – in this case you will still end up with CR blocks in the KEEP cache.


  1. Interesting indeed. Thanks for posting it. It’s really not a trivial case.

    Comment by Alex Gorbachev — November 21, 2006 @ 4:08 pm GMT Nov 21,2006

  2. This is *exactly* the same situation we ran into recently (I think I may have even emailed to you early in on the investigation).

    Do you see a lot of dirty blocks (possibly excessive to the # of total blocks of the object)?

    Your blog is excellent. It’s like a daily DBA class and I read it religiously.

    Comment by RobH — November 21, 2006 @ 4:17 pm GMT Nov 21,2006

  3. Fantastic spelunking!

    Comment by Jim — November 21, 2006 @ 4:18 pm GMT Nov 21,2006

  4. RobH, I’ve just dug out your original email. There is an overlap in the phenomena, but your initial question was about the appearance of 7,000 “dirty” CR buffers for an object that consisted of only 64 blocks.

    Although Oracle has had a notional limit of 6 copies of a CR block set by the _db_block_max_cr_dba parameter for many years, there have always been cases where this limit gets ignored – sometimes by an extreme margin – especially when the buffer cache is large, and often in highly concurrent systems that make use of the ‘select for update’ syntax.

    I don’t think I heard the resolution of your problem, but my impression was that you were seeing a very large trail of CR copies because the limit wasn’t being enforced – possibly due to the action that’s described in v$statname as “switch current to new buffer”.

    The fact that the CR copies left behind on these switches did not get their dirty bits cleared was the really buggy feature; but unless this error stopped the buffers from being re-used, it shouldn’t have been an issue (except for the misleading impression it may have given to any attempts to analyse the buffer cache).

    Comment by Jonathan Lewis — November 21, 2006 @ 4:57 pm GMT Nov 21,2006

  5. especially when the buffer cache is large, and often in highly concurrent systems that make use of the ’select for update’ syntax.

    This describes the situation to a “T”. I don’t pretend to “get” all of Oracle’s internals, but I am trying (I often have to read, re-read, look up docs, and read your articles).

    In the end, there is nothing wrong with the db other than a very large buffer maintaining a large # of copies of data blocks.

    It was more of an anomaly that I was trying to understand (trying to get to know the db).

    Comment by RobH — November 21, 2006 @ 7:33 pm GMT Nov 21,2006

  6. Superb catch, Jonathan. Thanks a lot: it just solved a problem that has been baffling me for months!

    Comment by Noons — November 21, 2006 @ 9:57 pm GMT Nov 21,2006

  7. Nice one. I’m about ready to update my CMG multiple buffer pools paper for 10g. I’ll be looking for this one since the workloads I use reak havoc with CR cloning

    Comment by kevinclosson — November 22, 2006 @ 4:30 pm GMT Nov 22,2006

  8. […] If you’ve read my note on free buffer waits you will have seen a comment that, in x$bh, objects with obj (data object id) values in the region of 4 billion were undo segments. There is another numbering anomaly you need to be aware of.  […]

    Pingback by Object IDs « Oracle Scratchpad — November 27, 2006 @ 4:16 pm GMT Nov 27,2006

  9. Hi Jonathan,

    Does a mismatch in the File System Block Size and DB Block Size cause Free Buffer Waits. We all know that the db_block_size should be a multiple of os block size.

    A typical scenario is that the file system block size is 8kb and db block size is 4kb. It is on HP Unix. The database suffers with huge Free Buffer Waits. The HP engineer suggests that the db block size should be either equal to or multiple to File System block size and this means, recreation of the database.

    My query is “can this cause free buffer waits” ?

    Would appreciate your reply on this.

    Comment by Viveshar — August 2, 2007 @ 8:53 am BST Aug 2,2007

  10. Viveshar,

    traditionally a high value for “free buffer waits” is an indication that the database writer is unable to clear dirty blocks fast enough. In the case of a 4KB db block size with an 8KB filesystem block size (and assuming you really are using the file system cache), an Oracle block write has to turn into a read/fill/write on an 8KB file system page. This will slow down your writes and could result in your “free buffer waits”.

    Before rebuilding your system, however, I would check the following:
    * is there any excess I/O (read or write) that could be eliminated (e.g. by tuning SQL)
    * could you switch critical file systems to using direct I/O (which may have a different name on HP-UX).
    * would introducing multiple (or extra) database writers help – which could be relevant if the I/O subsystem is not overloaded, and the problem is simply one of Oracle I/Os getting delayed in their queue.

    If you are running 9i, and felt you really had to go to an 8KB block size, you could try to work out where most of your I/O was taking place and move the critical data into tablespaces with an 8KB block size, rather than rebuilding the whole database. On the plus side, you could do the change over a period of time; on the minus side you have to think carefully about how you split the memory between the default (4KB) cache and the 8KB cache.

    Comment by Jonathan Lewis — August 2, 2007 @ 2:44 pm BST Aug 2,2007

  11. Thanks Jonathan for your reply.

    Just got a link of a posting by Steve Adams on this same subject. As you mentioned, even Steve says the same that this mismatch in block sizes can cause “Free Buffer Waits”. In his posting, he says that this is caused as DBWR writes Partial Blocks. I am of the opinion that since DBWR writes in Batch and not single blocks, how can this cause a performance issue.

    In your posting above, you said that “Oracle block write has to turn into a read/fill/write on a 8kb file system page”. Would appreciate, if you can elaborate on this read/fill/write concept.


    Comment by Viveshar — August 3, 2007 @ 7:29 am BST Aug 3,2007

  12. Viveshar,
    If I write a 4KB block, but the file system uses a unit size of 8KB, the write operation has to read the correct 8KB off disc into memory, write my 4KB into the right place in that 8KB, then write the 8KB block back to disc.

    Hence the read/fill/write.

    Even if DBWR happened to write two Oracle blocks that fitted into one 8KB file system block, the file system code would probably still see this as two separate events, and go through the cycle twice (although it would find the file system block cached when it started to write the second Oracle block).

    Comment by Jonathan Lewis — August 11, 2007 @ 7:37 am BST Aug 11,2007

  13. Thanks Jonathan.

    Nice explanation.

    Comment by Viveshar — August 11, 2007 @ 5:37 pm BST Aug 11,2007

  14. Excellent sir! Really a good learning!

    Comment by Aman Sharma — December 27, 2007 @ 5:12 pm GMT Dec 27,2007

  15. Thanks for the nice explanation.

    Please help to elaborate more on CR Cloning, Thanks.

    Comment by Sanjeev Chawla — February 17, 2008 @ 3:40 pm GMT Feb 17,2008

  16. Sanjeev,

    “CR Cloning” – to see the correct version of a block (as at a point in time), Oracle may have to copy it then apply undo records to take the copy back in time.

    When you see the statistic “consistent gets”, some will be “no work – consistent read gets” which means that this action was not necessary, but if you see some “data blocks consistent reads – undo records applied” you know that some blocks have been cloned to create the appropriate version. (The statistic “transaction tables consistent reads – undo records applied” is a more exotic example of the same thing).

    Comment by Jonathan Lewis — February 17, 2008 @ 8:32 pm GMT Feb 17,2008

RSS feed for comments on this post.

Website Powered by