Oracle Scratchpad

OC 5 Caches and Copies

Addenda and Errata for Oracle Core Chapter 5 Caches and Copies

Back to Index

Addenda

p.94  Second paragraph, I make the comment: “…it’s possible that there are other variations on limits and granule sizes that I haven’t yet come across.” Charles Hooper has added a few more variations in a comment he made on the Glossary, which links to a note he published in May 2011.

Errata

p.94 first para, first line: “automatic system memory management” should be “automatic shared memory management”.
p.101 The Note: “set parameter _disable_cpu_check to false” should be “set parameter _disable_cpu_check to true.
p.103 Figure 5.4: In the first of the two boxes labelled x$kcbwds, the far right-hand entry should be prv_replax rather than prv_repl.
p.106 Penultimate paragraph: I say that the touch count is halved as a buffer moves from the cold end of the LRU to the hot end. This is incorrect, the touch count drops to zero. (I have some notes from the very early days of the LRU/TCH (i.e. 8.1) which say I saw the halving effect, but even in 8.1.7 the parameter _db_aging_stay_count exists, is set to zero, and has the description “Touch count set when buffer moved to head of replacement list”. Of course, there are many cases where zero means “use a default internal value or action”, but it seems unlikely in this case.
p.107 Note at top of page: “changes the touch count on that buffer to zero”, should be “changes the touch count on that buffer to 1″.
p.110 Second para: “(a) every buffer header would be on one of the (repl) lines”. This is a simplification (i.e. it’s not true) aimed at highlighting the basic function of the LRU lists and latches. I should have included some sort of warning or forward reference to the discussion of the other linked lists such as the write lists that are described in ch. 6. See Martin’s comment below
p.114 Section “Logical I/O” 3rd Para (At this point…), second line: “consistent get – examination” should be “consistent gets - examination”
p.115 Section “Updates”, 2nd para, second line: buffer busy wait should be buffer busy waits wait.Section “Updates”, last para, penultimate line: “… older clones on to the replacement list if you …” should be “… older clones on to the auxiliary replacement list if you …”
p.115 Section “Loading a Hash Chain” says: “As I commented in Chapter 4, a process is not allowed to request a latch in willing-to-wait mode if it is already holding a lower-level latch.” This statement is the wrong way round (see note below from Sid)- you cannot request a latch in willing-to-wait mode if you are already holding a higher level latch; this error makes the subsequent comments about the complexity involved in dropping and re-acquiring cache buffers chains latches irrelevant – you don’t have to drop the latch.There is another odd error in the same sentence – I don’t say anything about the latch level and it’s use in controlling the order in which willing-to-wait gets can be made.
p.118 2nd para after the Note: “If the optimizer thought …” should be “If the runtime engine thought …”

Typos/Grammar/Style/Punctuation

Back to Index

63 Comments »

  1. Hi Jonathan,

    On page 3 of Chapter 5,you had mentioned the following:
    “Note: Buffer headers are tied very closely to buffers – there is a permanent one-to-one link between a row in the array of buffers and the corresponding row in the array of buffer headers”.

    does it mean that the total number of rows in x$bh is same as the total number of blocks cached in memory (_db_block_buffers)?
    In my 11gr2 database on windows 32-bit OS with the sga_target=252M,I am getting the following result.

    SQL> select count(*) from x$bh;

    COUNT(*)
    ———-
    13218

    NAME VALUE DESCRIPTION
    ——————————————— —————————— —————————————————
    _db_block_buffers 15376 Number of database blocks cached in memory: hidden parameter

    Thank you.
    Antony

    Comment by orapsdbaAntony — September 29, 2011 @ 6:42 pm BST Sep 29,2011 | Reply

    • Antony,

      In general (at steady state) I would expect the number of rows in x$bh to match the number of blocks that could be cached – but (a) I wouldn’t trust _db_block_buffers to be telling the truth, and (b) there are transitional states where I’m not 100% sure how Oracle would behave. One obvious transition is in the interval after startup (which I do know about), another is during a deferred resize operation when (for example) you might see data buffer space allocated in the shared pool as “KGH: No Access” entries.

      To give you some ideas of cross-checking: shortly after starting up an 11.2.0.2 instance, when I can see that _db_block_buffers = 30256, I get the following results:

      SQL> select count(*) from x$bh;
      
        COUNT(*)
      ----------
            7449
      
      
      SQL> select set_msize from v$Buffer_pool_statistics;
      
      SET_MSIZE
      ---------
          28768
      
      SQL> select cnum_repl, anum_repl from x$kcbwds where cnum_repl != 0;
      
       CNUM_REPL  ANUM_REPL
      ---------- ----------
           14384      10633
           14384      10644
      
      SQL> select current_size, granule_size, current_size/granule_size
        2  from v$memory_dynamic_components where component = 'DEFAULT buffer cache'
      ;
      
      CURRENT_SIZE GRANULE_SIZE CURRENT_SIZE/GRANULE_SIZE
      ------------ ------------ -------------------------
         243269632      4194304                        58
      
      

      You can see that x$bh currently doesn’t seem to have an entry for every buffer in the buffer pools, and v$buffer_pool_statistics does agree with the sum of cnum_repl from the working data set, and the working data set shows you that the value in x$bh matches the sum of cnum_repl – anum_repl (i.e x$bh is currently not showing any rows for buffers which are in the auxiliary replacement list, which – at this moment in time – is the buffers that have not yet been used at all).

      As a follow-on, the granule information shows that there are 58 granules in the default buffer cache – and 30,256 does not divide exactly by 58, so there is clearly something wrong with the way _db_block_buffers is derived.

      Comment by Jonathan Lewis — September 29, 2011 @ 9:22 pm BST Sep 29,2011 | Reply

      • Hi Jonathan,

        Thank you very much!
        I am amazed by the quick response considering your very tight schedule.
        I am eagerly waiting (of course with lots of doubts) to meet you in person on 6th of October when you will be presenting “Indexing Strategies” at NYOUG

        Antony

        Comment by Antony — September 29, 2011 @ 11:44 pm BST Sep 29,2011 | Reply

  2. Hi Jonathan,

    In the 5th chapter – 23 page , in the figure -5.6 (Oracle buffer Cache)
    How could I find the Granule and working list ?
    What is Granule and working list in the figure ?

    Many thanks

    Comment by berusadla — October 10, 2011 @ 1:25 pm BST Oct 10,2011 | Reply

    • Berusadla,

      Remember that a “working data set” is a linked list that cuts across the set of granules that makes up a buffer pool (see figure 5-3), and note that I made the point just below figure 5-6:

      “If I wanted to superimpose the replacement lists on Figure 5-6, they would appear as two tangled lines weaving their way randomly through the buffer headers …”

      It would be virtually impossible to draw a readable picture of a granule or a working data set on top of figure 5-6. In effect a granule would be a random collection of buffers (or buffer headers) from that figure, and a working data set would be part of that random set, and part of another random set representing a different granule, and so on.

      The key point to remember is that ANY buffer (header) from ANY working data set, granule or buffer pool could at some time be attached to a given cache buffers chains – and just to make the point explcitly, a single cache buffers chain could hold buffers for blocks of different block sizes.

      Comment by Jonathan Lewis — November 13, 2011 @ 11:40 am BST Nov 13,2011 | Reply

  3. Hi Jonathan,

    Granules - a physical chunk of memory – which has an array of buffer headers and buffer pointed by the buffer headers.
    Working list (logical) -> is a set of buffer headers from multiple Granules.

    This working list (set of buffer headers) is ORGANISED as 2 lists
    1.Replacement list
    2.Auxiliary replacement list

    Up to this clear –
    You said in the ‘summary’ —
    “There is a second structure imposed on the content of the data cache, again employing
    linked lists (but very short ones), that uses the data block address of the buffered blocks to
    scatter the buffers across a very large hash table”

    1.How HASH table and Working Set are related ??
    2.what are these linked lists ..?? Are they nothing but ‘Replacement /Auxiliary Replacement)?

    Many thanks for your time

    Comment by berusadla — October 10, 2011 @ 4:31 pm BST Oct 10,2011 | Reply

    • Sorry Jonathan, I went through your Chapters 4 and 5 completely couple of times and understood now , Please ignore the above questions and sorry for the troubles. But may I request to you to correct if my understanding (from your book) is wrong : abrief summary :

      1.SGA is a collection of GRANULES (a chunk of memory) – this GRANULES were introduced for the efficient move of memory among all the areas of SGA.
      2.DB cache is also made up of these GRANULEs which are linked through POINTERS .. each GRANULE contains aan array of BUFFER HEADERS and also array of Buffers separately

      3.To avoid the contention, heavy traffic .. to make things FASTER – a concept of Working Data set is introduced – this lies across the Granules.. basically this is a Linked list of Buffer Headers from multiple Granules and one DBWriter process is allocated to this WDS.

      4.Copying into memory(buffer), Writing buffer to Disc are done efficiently with help of working Dataset /DWriter process but How to find a block is already in memory – any technique to FIND ??? here it comes HASHing technique.. HASH algorithm , a tiny data structure attached to the buffer linked lists which helps us to TRACE the block if it is present in memory very QUICKLY ..

      Many thanks Jonathan for EXCELLENT BOOK ( I don’t say it is ONE OF the EXCELLENT BOOKS – but I say No 1-EXCELLENT book – that’s it)

      I already recommended this book for 5 of my friends and they bought it already 

      Comment by berusadla — October 12, 2011 @ 3:36 pm BST Oct 12,2011 | Reply

      • Berusadla,

        I’m busy trying to catch up with all the comment from the last 6 weeks or so and didn’t realise that you had answered your own questions.

        Your understanding is basically correct, but for point 4, I wouldn’t call the hash table a “tiny” data structure. The hash table has (roughly) twice as many buckets are there are buffers, and every buffer that is in use is associated with a hash bucket; if there are several buffers associated with the same hash bucket they are connected together through a (short) linked list.

        Comment by Jonathan Lewis — November 13, 2011 @ 11:51 am BST Nov 13,2011 | Reply

    • Berusadla,

      The hash table and the working data sets are not related – see my answer to your previous comment.
      The hash table is the core memory structure of the cache buffers chains – each chain is based on a single hash bucket in the hash table.
      The linked list are the very short cache buffers chains.

      Comment by Jonathan Lewis — November 13, 2011 @ 11:43 am BST Nov 13,2011 | Reply

  4. Jonathan,

    I might have found a couple of additional items for the errata list.

    * The first sentence on page 94 states, “Oracle Corp. introduced automatic system memory management (ASMM) in 10g…” The Oracle documentation (http://docs.oracle.com/cd/E11882_01/server.112/e16638/memory.htm#i58728) and seemingly most other resources that expand the ASMM acronym indicate that the ASMM acronym is short for Automatic Shared Memory Management. Is this a typo in the book? You stated “system” rather than “shared” in an OTN thread on Sep 19, 2011, so I might be mistaken https://forums.oracle.com/forums/thread.jspa?threadID=2285548 ).

    * In the Note section on page 101, the book describes how the CPU_COUNT parameter can be changed to determine the effect on memory allocations as the number of CPUs in the server changes. The note in the book states, “… however, in 11.2 you also need to set parameter _disable_cpu_check to false.” In 11.2.0.2 it appears that the _DISABLE_CPU_CHECK parameter defaults to FALSE, and (because the parameter is prefixed with DISABLE) that the note in the book should state, “… however, in 11.2 you also need to set parameter _disable_cpu_check to TRUE.”

    Comment by Charles Hooper — December 18, 2011 @ 5:43 pm BST Dec 18,2011 | Reply

    • Charles,

      Thanks, correct on both counts. I’ve added both to the errata.
      The system/shared switch is one of those little brain glitches that I manage to repeat fairly frequently. Luckily I did get it right in the glossary.

      Comment by Jonathan Lewis — December 19, 2011 @ 12:46 pm BST Dec 19,2011 | Reply

      • The majority of the book is a bit beyond my knowledge of Oracle Database internals – if I stand on a 30 feet (9 meter) tall step ladder, some of the material still manages to go over my head. The book will require a second (or third) read through for full comprehension. In short – thanks for pushing the boundaries for “how it works”.

        0 detected logic errors with a couple of accidental word substitutions in the first 130 pages is significantly better than most (maybe all) of the other books that I have read in the last couple of years.

        Comment by Charles Hooper — December 19, 2011 @ 1:15 pm BST Dec 19,2011 | Reply

  5. A couple of other very minor potential issues:

    * On page 114, the third paragraph under the heading Logical I/O states: “At this point we might be able to examine the block and drop the latch immediately. This action is reported in the instance statistics as a consistent get – examination;…”. It appears that the letter S was accidentally dropped from the word get in the statistic name (pages 44 and 51 also mention the consistent gets – examination statistic).

    * On page 115, the second paragraph under the heading Updates states: “… pin it and we might have to attach our pin to the waiters’ list and post a buffer busy wait.” This sentence is understandable, but might have been intended to state, “and post a buffer busy waits wait event.”

    * On page 118, the sentence in the middle of the page begins, “If the optimizer thought the table was larger than the 2 percent limit, then the buffers used to read the table were immediately dropped to the end of the LRU list as the blocks were read…” The previous paragraph describes how the runtime engine behaves when a table smaller than 2 percent of the buffer cache is full scanned. It appears that the word optimizer should be replaced with runtime engine.

    Comment by Charles Hooper — December 18, 2011 @ 9:18 pm BST Dec 18,2011 | Reply

    • Charles,

      Thanks again – correct on all three counts (and I found two more occurrences of the missing S in chapter 3).
      Now added to the errata.

      Comment by Jonathan Lewis — December 19, 2011 @ 12:44 pm BST Dec 19,2011 | Reply

  6. Jonathan,

    At the end of the section “Updates”
    —quote—
    So, to avoid getting too much garbage into the cache, and to keep the length of the hash chain short, Oracle will start switching the older clones on to the replacement list if you continue updating the block.
    —end of quote—

    In my understanding, the exceeding CR blocks should be moved from the main replacement list to auxiliary replacement list, to mark it free for reuse, right?

    I do a little test on my 10.2.0.4 on OS X to verify: the default value of _db_block_max_cr_dba is 6. I keep update the block 10 times, and max cr versions on the hash chain is 5, just as you mentioned. when the current version is switched to CR versions, the TCH is increased from 1 to 2.

    drop table t purge;
    
    create table t
    as select object_id, object_name from all_objects where rownum <= 10;
    
    exec dbms_stats.gather_table_stats(user,'T');
    
    exec snap_stats.start_snap;
    exec snap_stats.start_snap;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    pause;
    update t set object_id = 65536;
    set serveroutput on
    exec snap_stats.end_snap;
    
    sys@CS10G> @rowid oe.t rownum=1;
        RFILE#     BLOCK#
    ---------- ----------
             6     141450
    
    sys@CS10G> @dba3 6 141450
    
    STATE      BLOCK_CLASS        OBJECT_TYPE         object          TCH  MODE_HELD D T P S D FLG_LRUFLG
    ---------- ------------------ ------------------- -------- ---------- ---------- - - - - - ----------
    xcur       data block         TABLE               OE.T              1          0 Y N N N N 2002001:0
    cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2
    cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2
    cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2
    cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2
    cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2 
    
    6 rows selected.
    
    sys@CS10G> select n.ksppinm p_name, c.ksppstvl p_value
      2  from sys.x$ksppi n, sys.x$ksppcv c
      3  where n.indx=c.indx
      4  and lower(n.ksppinm) like lower('%&1%');
    
    NAME                                     VALUE
    ---------------------------------------- ----------------------------------------
    _db_block_max_cr_dba                     6
    
    

    Comment by Sid — December 31, 2011 @ 2:40 am BST Dec 31,2011 | Reply

    • Sid,

      Thanks for that. Yes, the text should have said: “… on to the auxiliary replacement list …”.
      As you’ve shown in your results, although the parameter suggests you can get 6 CR copies, I tend to see five CR and one current.

      Comment by Jonathan Lewis — January 3, 2012 @ 7:35 pm BST Jan 3,2012 | Reply

  7. Hi Jonathan,

    Here is some thoughts and testing about the Tablescan section.

    1.
    The threshhold is 10 percent, when table scan changes from (short tables) to (long tables), it’s not 2 percent, I guess the _small_table_threshhold just control whether you can do direct read on the table, it may have nothing to do with how to cache the data.

    2. When the data blocks less than 10 percent of the buffer pool, the blocks are cached and subsequent tablescan doesn’t need to do physical read. The TCH is incremented per access on 3 seconds interval. It’s expected.

    3. When the data blocks between 10 and 25 percent of the buffer pool, the blocks are cached and subsequent tablescan doesn’t need to do physical read, or just do several physical reads. But the subsequent tablescan is not marked as short tablescan and the TCH is not increases as you described. “however if you repeat the tablescan while the blocks are still in the data cache, the touch count on the buffers will be incremented at that point and the tablescan will be reported as a short tablescan”.

    4. When the data blocks slightly excced 25 percent of the buffer pool, the blocks are not cached and subsequent tablescan needs to do physical read. The TCH is not incremented as well, just as mentioned. The result and 25 threshhold match the testing by Dion Cho

    5. I try to cheat the CBO by hacking the numblks to number exceeding the buffer pool size, or to every small number, using dbms_stats.set_table_stats, but the runtime engine choose the right scan method according to the real data blocks. I guess the runtime engine will check the extent mapping table and high water mark on the segment header block, to get the exact data blocks under HWM.

    6. The test on tablespace ASSM show the same result, except the threshhold need to increased a bit,to include the blocks comes from 1st level bmp ans 2st level bmp. Automatic Shared Memory Management(ASMM) is disabled on my testing, the result with ASMM, or on different versions/platforms may vary.

    Here comes the test, 10.2.0.4.0 on OS X 10.6, db_cache_size around 96M.

    sys@CS10G> select * from v$version;
    
    BANNER
    ----------------------------------------------------------------
    Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod
    PL/SQL Release 10.2.0.4.0 - Production
    CORE	10.2.0.4.0	Production
    TNS for MacOS X Server: Version 10.2.0.4.0 - Production
    NLSRTL Version 10.2.0.4.0 - Production
    
    
    sys@CS10G> select buffers, buffers/50 "2_percent", buffers/10 "10_percent", buffers/4 "25_percent" from v$buffer_pool;
    
       BUFFERS  2_percent 10_percent 25_percent
    ---------- ---------- ---------- ----------
         12350        247       1235     3087.5
    

    The code to do test, create table big_table and T. The big table is to fill up the buffer_cache; capture the statisitcs for tablescan on T, sleep 3 seconds between table scan on table T, see if the TCH incremented. Finally connect to sys, check the X$bh and dba_objects to check the TCH on data blocks.

    conn oe/oe
    set serveroutput on
    drop table T purge;
    drop table big_table purge;
    
    -- test on tablespace MSSM, without ASMM
    create table T
    pctfree 90
    pctused 10 
    tablespace test_8k
    as select 
    rownum n1,
    lpad(rownum, 1000,'0') padding
    from dual
    connect by level <= 247;
    
    --connect by level <= 247;	--2%,  table scan (small tables), blocks are cached,    TCH is incremented
    --connect by level <= 1235;	--10%, table scan (long tables),  blocks are cached,    TCH is not incremented
    --connect by level <= 3190;	--25%, table scan (long tables),  blocks are not cache, TCH is not incremented
    
    create table big_table
    pctfree 90
    pctused 10 
    tablespace test_8k
    as select 
    rownum n1,
    lpad(rownum, 1000,'0') padding
    from dual
    connect by level <= 12350;
    
    exec dbms_stats.gather_table_stats(user,'T');
    exec dbms_stats.gather_table_stats(user,'big_table');
    
    --flush the buffer_cache
    alter system flush buffer_cache;
    
    --fill up the buffer_cache
    select count(*) from big_table;
    
    exec snap_stats.start_snap;
    select count(*) from T;
    exec snap_stats.end_snap;
    
    exec dbms_lock.sleep(3);
    
    exec snap_stats.start_snap;
    select count(*) from T;
    exec snap_stats.end_snap;
    
    exec dbms_lock.sleep(3);
    
    exec snap_stats.start_snap;
    select count(*) from T;
    exec snap_stats.end_snap;
    
    conn / as sysdba
    col dba_object head object for a5 truncate
    
    select block_class, object_type, dba_object, tch, count(*)
    from
    (select  /*+ ORDERED */
            decode(bh.class,1,'data block',2,'sort block',3,'save undo block',
                   4,'segment header',5,'save undo header',6,'free list',7,'extent map',
                   8,'1st level bmb',9,'2nd level bmb',10,'3rd level bmb', 11,'bitmap block',
                   12,'bitmap index block',13,'file header block',14,'unused',
                   15,'system undo header',16,'system undo block', 17,'undo header',
                   18,'undo block'
            ) block_class,
            o.object_type,
            o.owner||'.'||o.object_name             dba_object,
            bh.tch
    from
            x$bh            bh,
            dba_objects     o
    where
            bh.obj = o.data_object_id
    and     o.data_object_id in ( select data_object_id from all_objects where object_name='T' and owner='OE')
    )
    group by block_class, object_type, dba_object, tch
    /
    

    data blocks = 247; 2%, table scan (small tables), blocks are cached, TCH is incremented

      COUNT(*)
    ----------
           247
    
    ---------------------------------
    System stats:-        31-Dec 15:33:52
    Interval:-        0 seconds
    ---------------------------------
    Name                                                 Value
    ----                                                 -----
    consistent gets                                         256
    physical reads                                          248
    physical reads cache                                    248
    table scans (short tables)                                1
    table scan blocks gotten                                247
    
    ---------------------------------
    System stats:-        31-Dec 15:33:55
    Interval:-        0 seconds
    ---------------------------------
    Name                                                 Value
    ----                                                 -----
    consistent gets                                        256
    table scans (short tables)                               1
    table scan blocks gotten                               247
    
    ---------------------------------
    System stats:-        31-Dec 15:33:58
    Interval:-        0 seconds
    ---------------------------------
    Name                                                 Value
    ----                                                 -----
    consistent gets                                        256
    table scans (short tables)                               1
    table scan blocks gotten                               247
    
    BLOCK_CLASS            OBJECT_TYPE objec        TCH   COUNT(*)
    ------------------ --------------- ----- ---------- ----------
    data block                   TABLE OE.T           3        247
    segment header               TABLE OE.T           3          1
    
    
    

    data blocks = 1235; 10%, table scan (long tables), blocks are cached, TCH is not incremented

      COUNT(*)
    ----------
          1235
    
    ---------------------------------
    System stats:-        31-Dec 15:43:56
    Interval:-        0 seconds
    ---------------------------------
    Name                                                                     Value
    ----                                                                     -----
    consistent gets                                                          1,244
    physical reads                                                           1,236
    physical reads cache                                                     1,236
    table scans (long tables)                                                    1
    table scan blocks gotten                                                 1,235
    
    ---------------------------------
    System stats:-        31-Dec 15:43:59
    Interval:-        0 seconds
    ---------------------------------
    Name                                                                     Value
    ----                                                                     -----
    consistent gets                                                          1,244
    table scans (long tables)                                                    1
    table scan blocks gotten                                                 1,235
    
    ---------------------------------
    System stats:-        31-Dec 15:44:02
    Interval:-        0 seconds
    ---------------------------------
    Name                                                                     Value
    ----                                                                     -----
    consistent gets                                                          1,244
    table scans (long tables)                                                    1
    table scan blocks gotten                                                 1,235
    
    BLOCK_CLASS           OBJECT_TYPE objec        TCH   COUNT(*)
    ------------------ -------------- ----- ---------- ----------
    data block         TABLE          OE.T           0       1235
    segment header     TABLE          OE.T           3          1
    
    

    data blocks = 3190; 25%, table scan (long tables), blocks are not cache, TCH is not incremented

      COUNT(*)
    ----------
          3190
    
    ---------------------------------
    System stats:-    31-Dec 15:48:41
    Interval:-        0 seconds
    ---------------------------------
    Name                                                                     Value
    ----                                                                     -----
    consistent gets                                                          3,201
    physical reads                                                           3,191
    table scans (long tables)                                                    1
    table scan blocks gotten                                                 3,190
    
    ---------------------------------
    System stats:-    31-Dec 15:48:45
    Interval:-        0 seconds
    ---------------------------------
    Name                                                                     Value
    ----                                                                     -----
    consistent gets                                                          3,201
    physical reads                                                           3,190
    physical reads cache                                                     3,190
    table scans (long tables)                                                    1
    table scan blocks gotten                                                 3,190
    
    ---------------------------------
    System stats:-    31-Dec 15:48:49
    Interval:-        0 seconds
    ---------------------------------
    Name                                                                     Value
    ----                                                                     -----
    consistent gets                                                          3,204
    physical reads                                                           3,190
    physical reads cache                                                     3,190
    table scans (long tables)                                                    1
    table scan blocks gotten                                                 3,190
    
    
    BLOCK_CLASS        OBJECT_TYPE         objec        TCH   COUNT(*)
    ------------------ ------------------- ----- ---------- ----------
    data block         TABLE               OE.T           0       3084
    segment header     TABLE               OE.T           4          1
    
    

    #

    Comment by Sid — December 31, 2011 @ 8:28 am BST Dec 31,2011 | Reply

    • Sid,

      I’ve done a little cosmetic work on your output.
      Your results don’t match the ones I published a few months ago in a note on short tables: http://jonathanlewis.wordpress.com/2011/03/24/small-tables/ but mine were from 10.2.0.3 and yours from 10.2.0.4, also I had ensured that the cache was fully populated with other objects before I started the test and this can make a difference (it’s possible you did the same, but you didn’t mention it), there’s also the possibility of internal “rounding errors” – the boundaries may be a little way off our expectations.

      Comment by Jonathan Lewis — January 3, 2012 @ 6:50 pm BST Jan 3,2012 | Reply

      • You are correct, I did not actually fill up the buffer pool in advance. A big fulll table scan on the *clean* buffer pool does not filled up the buffer pool as expected. I like you test case and repeat on 11.2.0.2.0, the result matches. I’ve Post the result at my blog: http://sid.gd/small-tables/

        Comment by Sid — February 17, 2012 @ 12:59 am BST Feb 17,2012 | Reply

  8. Jonathan,

    In the section Relinking a Buffer, the TCH is 1, so is in the Figure 5-5
    As the buffer cross the midpoint and becomes the target of the cold_hd pointer.

    in Note below above statement
    changes the touch count on that buffer to zero, and then modifies cold_hd to point to that buffer. << touch count should be 1 here?

    This is my third round on the book, and I still enjoy it :-)

    Comment by Sid — January 7, 2012 @ 2:36 pm BST Jan 7,2012 | Reply

    • Sid,

      I really ought to have had about 4 separate copies of that diagram so that I could show before and after images for the load and the promotion/push down. As it is, the buffer header you’re talking about needs a tch of 1 when it’s demonstrating the loading mechanism, but a touch count of zero when it’s demonstrating the promotion/pushdown effect.

      Comment by Jonathan Lewis — February 15, 2012 @ 2:17 am BST Feb 15,2012 | Reply

      • Hi. Jonathan.
        i don’t know, why reset to zero.
        if buffer will be forced to cross over the threshold from the hot region to cold region,
        then oracle resets the buffer tch=1 (reset value is controlled by _db_aging_cool_count and default value 1) ?

        Comment by sean_kim — October 2, 2012 @ 12:58 am BST Oct 2,2012 | Reply

    • Sid,

      I’ve just been re-reading these comments and realised I misunderstood what you were saying. It is just a simple error (which I’ve now added to the Errata) in the Note at the top of page 107. It should say “changes the touch count on that buffer to one.”

      Comment by Jonathan Lewis — March 10, 2013 @ 1:32 pm BST Mar 10,2013 | Reply

  9. Hi Jonathan,

    In page 116 of the section “Loading a Hash Chain” in Chapter 5.

    —quote—

    SQL> select     name, level#
      2  from       v$latch
      3  where      name in ('cache buffers lru chain','cache buffers chains')
      4  /
     
    NAME                               LEVEL#
    -------------------------------    ------
    cache buffers lru chain                 2
    cache buffers chains                    1
    

    The cache buffers chains latch has a lower level than the cache buffers lru chain latch, so we can’t request the cache buffers lru chain latch in willing-to-wait mode if we’re already holding the cache buffers chains latch. Think about what this means: we’re holding the cache buffers chains latch (which I will call the hash latch for the rest of this subsection) because we’ve just searched the hash chain for a buffer and discovered that, for whatever reason, we need to add another buffer to the chain. So we have to acquire the cache buffers lru chain latch (which I will call the lru latch for the rest of this subsection) to move a buffer from the REPL_AUX list to the midpoint of the REPL_MAIN list; but we can’t request it in willing-to-wait mode because we’re already holding a lower-level latch.


    But if you can’t get the lru latch with an immediate get, you have to drop the hash latch, get the lru latch, and then get the hash latch again.

    —quote—

    The test from me and other folks show that Oracle will get the lru latch in willing-to-wait mode while holding the hash latch, opposite to what you suggested. I have summary the result at my blog. It’s a long post that i’ll not paste here. You may want to check it out: http://sid.gd/latch-level/

    Comment by Sid — February 14, 2012 @ 1:12 pm BST Feb 14,2012 | Reply

    • Sid,

      Thanks for that – you’re absoutely correct, I got the significance of the latch levels the wrong way round.
      If you try to get a LOWER latch level in willing to wait mode while holding a higher latch level, then your session raises an ORA-00600 error, with parameter 1 set to 504, e.g.:

      ORA-00600: internal error code, arguments: [504], [0x2D5703C4], [32], [4], [business card], [2], [0], [0x0], [], [], [], []

      The other parameters are:
      [child latch addr] [???] [level#] [name] [child#] [??] [??]

      I like the way you decided to use kslgetl, rather than poke – there was always the possibility that hacking directly into memory would miss some important operation.

      I used the “flashback copy” latch (level 5) and the “business card” latch (level 4) on a small 11.1.0.7 system to set up the crash that produced the error above, by the way.

      Comment by Jonathan Lewis — February 15, 2012 @ 1:44 am BST Feb 15,2012 | Reply

      • Jonathan,

        I also encountered the ORA-00600 504 if I issue query on the session holding the lru latch.

        ORA-00600: internal error code, arguments: [504], [0x3A76C274], [8196], [1], [session idle bit], [2], [0], [0x0], [], [], [], []

        If you attempt to release a latch you are not holding, 510 error will be triggered.

        sys@CS11GR2> oradebug peek 0x399DC524 16
        [399DC524, 399DC534) = 00000000 00002F90 00000096 00000002
        sys@CS11GR2> oradebug call kslfre 966640932
        ORA-00600: internal error code, arguments: [510], [0x399DC524], [cache buffers lru chain], [], [], [], [], [], [], [], [], []

        Comment by Sid — February 15, 2012 @ 5:22 am BST Feb 15,2012 | Reply

  10. Hi Jonathan
    I love this book – and I love this section the most.
    However, I was wondering what happens when the Database Smart Flash Cache features is enabled in 11g? There is now a “second tier” of buffer cache on (presumably flash) storage somewhere. Where does the COLD_HD pointer point to in this scenario? I presume that it would still point to somewhere in the “first tier” of buffer cache, i.e. somewhere in the middle of the portion that remains in memory?

    Comment by Dmitri — February 14, 2012 @ 2:08 pm BST Feb 14,2012 | Reply

    • Dmitri,

      Thanks for the comment, I’m glad you like it.

      I don’t know how Oracle is handling the database flash cache, but there is a tiny reference in chapter 6, page 147, where I have a description of x$kcbwds and the various linked list pointers that reference L2R, L2W and L2K, with a follow-up note on page 148 where I mention a new cache buffers lru chains latch and surmise that there may be a parallel set of structures in the SGA doing for the smart flash cache the things that we know about for the traditional buffer cache.

      Comment by Jonathan Lewis — February 15, 2012 @ 2:12 am BST Feb 15,2012 | Reply

  11. Hello Jonathan,

    Thanks for this great book. I read it with great pleasure and it clarifies many open questions I had concerning Oracle.

    I have a comment regarding the discussion in chapter 5, page 95 about the reasons why Oracle choose to use two separate arrays for the buffers and buffer headers. An additional reason I believe could be driving this decision is the efficient use of hardware caches.

    With all the linked lists linking the buffer headers, there will be many times when different elements from the buffer header array will be accessed consecutively. If the cache line size of any hardware cache is sufficiently large compared to the buffer header size, it will improve the cache efficiency of this cache if the buffer headers are stored in a physically contiguous memory area.

    Also on average less TLB entries will be required to access any subset of all the buffer headers (e.g. when following a linked list) if the buffer headers are stored in a physically contiguous area (as opposed to having them spread throughout the entire granule).

    I don’t know if a granule is physically contiguous memory (do you know?). However I believe the above reason holds even if the granules are not physically contiguous. By having the buffer headers stored in a contiguous address range (of virtual memory), they will at least physically end up in a smaller number of distinct physical memory pages than if they where scattered throughout the entire (virtual) address range covering the granule.

    kind regards
    Martin

    Comment by Martin Maletinsky — March 19, 2012 @ 10:22 pm BST Mar 19,2012 | Reply

    • Martin,

      Thanks for the observation.

      I have to say that I usually go to James Morle or Kevin Closson when I have questions that get that close to the hardware – but it certainly sounds like the sort of thing that the Oracle developers have done in other parts of the code to make extremely frequent operations a little bit faster.

      Comment by Jonathan Lewis — March 31, 2012 @ 6:39 am BST Mar 31,2012 | Reply

  12. Hello Jonathan,

    I tried to experimentally verify your claim (page 116) that consistent read copies may be created in another pool than the original block belongs to. I set up the experiment you find below, but I could not find any evidence of your claim (the consistent read copies I saw were always created in the same pool as the original block). According to your note on page 116 I would have expected the consistent read copy for table_k to be created in the recycle pool – it was however created in the keep pool (see below).

    Maybe I misunderstood something and there is a flaw in my experimental setup or in the query I use to verify the outcome – could you please tell me how I can modify my experiment to see current read copies being created in another pool than the orignal block belongs to?

    Thanks a lot for your help
    kind regards
    Martin

    Test Setup / participating sessions
    ———————————————-

    I use 3 sessions in my test:

    Session A: creates the tables, makes the blocks dirty and does some queries to show DB version and statistics of Session B

    Session B: executes the query that needs the consistent read copies
    SYSDBA SESSION: repeatedly executes the following query to view the blocks in the buffer cache

    -------------------------------------------------
    -- QUERY X$ VIEWS TO VIEW BUFFERED BLOCKS AS SYSDBA
    ----------------------------------------------------
    --
    COLUMN NAME      FORMAT a20 HEADING 'NAME'
    COLUMN BP_NAME   FORMAT a10 HEADING 'POOL_NAME'
    COLUMN STATE     FORMAT 999 HEADING 'STATE'
    --
    select count(*)
          ,substr(o.object_name, 1, 20) name
          ,pool.bp_name
          ,bh.state
          ,DECODE(bh.state,0,'FREE',1,'XCUR',2,'SCUR',
                           3,'CR',4,'READ',5,'MREC',6,'IREC')  state_n
      from x$bh          bh
          ,dba_objects   o
          ,x$kcbwds      wds
          ,x$kcbwbpd     pool
     where o.object_name in ('TEST_DF', 'TEST_K', 'TEST_R')
       and bh.obj=o.data_object_id
       and bh.set_ds = wds.addr
       and pool.indx = wds.pool_id
      group by o.object_name,pool.bp_name,bh.state
     order by name, state_n;
    
    
    Test Execution:
    ---------------------
    
    ----------------------------------------------------------------------
    -- 0. SESSION A - Show DB-Version and statistics for Session B (SID 6)
    ----------------------------------------------------------------------
    
    K.AF234C.AVALOQ> select * from v$version;
    
    BANNER
    --------------------------------------------------------------------------------
    Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
    PL/SQL Release 11.2.0.3.0 - Production
    CORE    11.2.0.3.0      Production
    TNS for Solaris: Version 11.2.0.3.0 - Production
    NLSRTL Version 11.2.0.3.0 - Production
    
    5 rows selected.
    
    K.AF234C.AVALOQ> 
    K.AF234C.AVALOQ> 
    K.AF234C.AVALOQ> select *
      2    from v$sesstat  stat
      3        ,v$statname name
      4   where stat.sid = 6
      5     and name.name in ('CR blocks created', 'switch current to new buffer')
      6     and stat.statistic# = name.statistic#;
    
           SID STATISTIC#      VALUE STATISTIC# NAME                                                                  CLASS    STAT_ID
    ---------- ---------- ---------- ---------- ---------------------------------------------------------------- ---------- ----------
             6        146          0        146 CR blocks created                                                         8  516801181
             6        148          0        148 switch current to new buffer                                              8 1312802324
    
    2 rows selected.
    
    
    ----------------------------------------
    -- 1. SESSION A - PREPARE TEST TABLES
    ----------------------------------------
    
    create table test_df(id number(9), txt varchar2(3000)) storage(buffer_pool default);
    
    create table test_k(id number(9), txt varchar2(3000)) storage(buffer_pool keep);
    
    create table test_r(id number(9), txt varchar2(3000)) storage(buffer_pool recycle);
    
    insert into test_df values(1, lpad('x', 2000, 'x'));
    
    insert into test_k values(1, lpad('x', 2000, 'x'));
    
    insert into test_r values(1, lpad('x', 2000, 'x'));
    
    commit;
    
    ----------------------------------------
    -- SYSDBA SESSION - view buffer cache
    ----------------------------------------
    
      COUNT(*) NAME                 POOL_NAME  STATE STAT
    ---------- -------------------- ---------- ----- ----
            16 TEST_DF              DEFAULT        1 XCUR
            16 TEST_K               KEEP           1 XCUR
            16 TEST_R               RECYCLE        1 XCUR
    
    3 rows selected.
    
    -------------------------------------------
    -- 2. SESSION A - INSERT ROWS, DO NOT COMMIT;
    -------------------------------------------
    
    insert into test_df values(2,'abc');
    
    insert into test_k values(2,'abc');
    
    insert into test_r values(2,'abc');
    
    
    ----------------------------------------
    -- SYSDBA SESSION - view buffer cache
    ----------------------------------------
    
      COUNT(*) NAME                 POOL_NAME  STATE STAT
    ---------- -------------------- ---------- ----- ----
            16 TEST_DF              DEFAULT        1 XCUR
            16 TEST_K               KEEP           1 XCUR
            16 TEST_R               RECYCLE        1 XCUR
    
    3 rows selected.
    
    
    -------------------------------------------------------------------------------------------
    -- 3. SESSION B - QUERY TABLES (NEEDS TO CREATE CONSISTENT COPIES TO HIDE UNCOMMITTED CHANGES)
    -------------------------------------------------------------------------------------------
    
    select * from test_df;
    
    select * from test_k;
    
    select * from test_r;
    
    ----------------------------------------
    -- SYSDBA SESSION - view buffer cache
    ----------------------------------------
    
      COUNT(*) NAME                 POOL_NAME  STATE STAT
    ---------- -------------------- ---------- ----- ----
             1 TEST_DF              DEFAULT        3 CR
            16 TEST_DF              DEFAULT        1 XCUR
             1 TEST_K               KEEP           3 CR
            16 TEST_K               KEEP           1 XCUR
             1 TEST_R               RECYCLE        3 CR
            16 TEST_R               RECYCLE        1 XCUR
    
    6 rows selected.
    
    ----------------------------------------------------------------------------------------------
    -- SESSION A - QUERY STATISTICS OF SESSION B (SID 6) to confirm increase in CR blocks created
    ----------------------------------------------------------------------------------------------
    
    K.AF234C.AVALOQ> select *
      2    from v$sesstat  stat
      3        ,v$statname name
      4   where stat.sid = 6
      5     and name.name in ('CR blocks created', 'switch current to new buffer')
      6     and stat.statistic# = name.statistic#;
    
           SID STATISTIC#      VALUE STATISTIC# NAME                                                                  CLASS    STAT_ID
    ---------- ---------- ---------- ---------- ---------------------------------------------------------------- ---------- ----------
             6        146          3        146 CR blocks created                                                         8  516801181
             6        148          0        148 switch current to new buffer                                              8 1312802324
    
    2 rows selected.
    
    K.AF234C.AVALOQ>    
    
    -------------------------------------------------------------------------
    -- SESSION B - SHOW SID
    -------------------------------------------------------------------------
    
    K.AF234C.AVALOQ> select sys_context('userenv', 'sid') from dual;           
    
    SYS_CONTEXT('USERENV','SID')
    ----------------------------------------------------------------------------------------
    6
    
    1 row selected.
    

    Comment by Martin Maletinsky — May 14, 2012 @ 8:56 am BST May 14,2012 | Reply

    • I’ve run your query on a 11.2.0.3 database with configured keep/recycle buffer pools of 5/1 GB, several tables assigned to the KEEP pool and based on the output I see following: CR block copies of a segment from KEEP of DEFAULT pool may be created in KEEP/RECYCLE or DEFAULT/RECYCLE pools correspondingly.

      Comment by Timur Akhmadeev — May 14, 2012 @ 12:14 pm BST May 14,2012 | Reply

      • Hello Timur,

        Thank you for your quick reply.
        You write “may be created …” – what conditions determine if the CR copies are created in the original cache or in the recycle cache?

        As you see in the above output, I used 11.2.0.3 as well. Being a developer database, my database has however much smaller caches: 304 MegaByte default buffer cache and 48 MegaByte for keep and recycle cache each. Does the behaviour change with the pool size (and if so, what is the reason for that)?

        thanks a lot
        kind regards
        Martin

        Comment by Martin Maletinsky — May 14, 2012 @ 2:18 pm BST May 14,2012 | Reply

        • Hello Martin

          You write “may be created …” – what conditions determine if the CR copies are created in the original cache or in the recycle cache?

          This is beyond my knowledge. And I don’t know how to make a reliable test for such a research.

          Comment by Timur Akhmadeev — May 14, 2012 @ 2:37 pm BST May 14,2012

      • Martin,

        Thanks for doing this experiment and posting the results. I can’t say why your results show the CR clones being created in the same buffer pool – but a couple of thoughts come to mind:

        a) the mechanism that Oracle uses for updates by tablescan is not the same as the mechanism for updates by indexed access – and, of course, the mechanisms do vary with version. Perhaps you would see a different pattern of behaviour if you did your updates by indexed access.

        b) it’s possible (though I think unlikely) that Oracle treats blocks with committed transactions that have to be reversed out differently from blocks which have only uncommitted transactions. If you ran your experiment but made Session B issue “set transaction read only” before doing the updates from session A you might see a difference.

        There’s an example of stats from a production system – and the side effects – in this article: http://jonathanlewis.wordpress.com/2006/11/21/free-buffer-waits/

        Comment by Jonathan Lewis — May 24, 2012 @ 12:18 pm BST May 24,2012 | Reply

  13. Hi Jonathan,

    You didn’t mention anything about the CACHE property of a table in your section on full tablescans. After reading this part of your book, I did some experiments that suggest that this only results in the touch count being incremented for full tablescans on large tables when the table is in the keep buffer pool! This surprised me as I could have sworn that I have seen performance improvements setting the CACHE attribute on largish tables that are frequently accessed with FTS on systems with no keep pool. Maybe I was dreaming. Also, my experiments suggest that index fast full scans always increment TCH regardless of the buffer pool or the CACHE attribute on the underlying table.

    Comment by tonyhasler — August 9, 2012 @ 11:24 am BST Aug 9,2012 | Reply

    • Tony,

      Thanks for the note.

      I think you can probably trust your memory about the behaviour – there have been so many variations on the interaction of the CACHE parameter and the various buffer pools across the last few versions of Oracle that there’s probably a specific patch release where your desrciption is absolutely correct.

      Comment by Jonathan Lewis — August 9, 2012 @ 12:29 pm BST Aug 9,2012 | Reply

  14. I forgot to mention that my observations were on 11.2.0.2.

    Comment by tonyhasler — August 9, 2012 @ 2:21 pm BST Aug 9,2012 | Reply

  15. Hi. jonathan.
    i have question about TCH.
    basically, I think that buffer is moved to the MRU then its touch count is reset to 0.
    only if _db_aging_stay_count >= _db_aging_hot_crieria then touch count is set to the current touch count divied by two.

    so, ficture 5-5. tch 4-> tch 2 is right ?

    Comment by sean_kim — October 2, 2012 @ 12:42 am BST Oct 2,2012 | Reply

    • Sean Kim,

      Thanks for the comment.

      I know I’ve seen at least one article saying that the TCH drops to zero when a buffer is promoted to the MRU end of the list. I also know that the tests I did shortly after reading the article (playing very carefully with a small buffer cache and watching x$bh) showed the TCH halving. I haven’t repeated the tests on a very recent version of Oracle, so I don’t know the current state of play – and I could be wrong. I do know that I was puzzled by both possibilities given the way the TCH changed to 1 as the buffer hit the midpoint anyway.

      It’s possible that Oracle corp. had been using all those hidden parameters to adjust the rate at which blocks were allowed to drop off the cache, and there are version dependencies that make a difference. Your observation about the relationship between those two parameters is an interesting insight into how convoluted the mechanisms may be. One day I may find the time to work through the same tests on the latest versions.

      Comment by Jonathan Lewis — December 31, 2012 @ 11:50 am BST Dec 31,2012 | Reply

  16. Hi Jonathan,
    Many thanks for such an excellent book. As an Oracle newbie, I have difficulty understanding the bitmap segment space management of ASSM tablespace you explain under “Reducing Table Contention(ASSM)” in chapter5. Several searches I have made show the internal structure of ASSM blocks as organised in the form of a tree consisting of the segment header and bitmap block levels(you make mention of bitmap levels 1 and 2 in page99) which really baffles me.

    Doing it in your inimitable style, I will be grateful for explanations taking into consideration the following:
    1. the allocation and internal structure of ASSM blocks
    2. the levels of bitmap blocks
    3. the bits used in determining free space in a data block.

    I also appreciate any links to resources that can help my better understanding on the above issues. Many thanks in advance.

    Comment by Ernest — October 18, 2012 @ 3:25 am BST Oct 18,2012 | Reply

    • Ernest,

      Your question requires a very lengthy answer, which I don’t have time to supply here. But a short note on the “tree” structure might help.

      Level 1 bitmap blocks contains a list of consecutive blocks in the segment with an approximate measure of how used they are; usage is “unformatteed, 25% used, 50% used, 75% used, full”. (It’s possible that level 1 bitmap blocks also list themselves (and the other space management blocks) with a status like “system”, but I’ll have to check that). At the most extreme case I believe a single level 1 bitmap block can list 1024 consecutive blocks in the segment.

      Level 2 bitmap blocks contain a list of level 1 bitmap blocks with a rough indication of how much free space each level 1 bitmap block is (indirectly) reporting as available.

      Level 3 bitmap blocks contain a list of level 2 bitmap blocks with a rough indication of how much free space each level 2 bitmap block is (doubly indirectly) reporting as available. The segment header block has an alternative identity as a level 3 bitmap block, and can list multiple level 3 bitmap blocks, and pointers to the next “most appropriate” level 3, 2, or 1 bitmap block to check when a session needs to search for space.

      The way in which level 3 bitmaps point to level 2 bitmaps which point to level 1 bitmaps gives you the image of the tree-like structure that is occasionally used in explanations of these segment space management blocks.

      There are a few articles on the Oak Table feed about ASSM, some of them may be of interest (some are from my own blog, of course). Otherwise I think Christian Antognini has some good stuff in his book – but a quick search on Google suggests that most of the information available is either trivial, out of date, copied from Oracle publications, wrong, or descriptions of exotic bugs.

      Here’s something I wrote on the OTN forums a few years ago that might also be of interest: https://forums.oracle.com/forums/message.jspa?messageID=3151619#3151619 though I’m not certain about the comment about linked lists of L3 blocks, and I’ve been told of an anomaly where parallel direct inserts can result in lots of L3 blocks appearing in a single (fairly small) segment.

      Comment by Jonathan Lewis — October 18, 2012 @ 9:48 am BST Oct 18,2012 | Reply

      • Hi Jonathan,
        Thanks for the excellent response. That together with the sources you recommended has helped a lot. However, I still have some dim areas I feel you can only make clear.
        1. About the Segment header block. I understand in MSSM it contains information about the pointers to the various freelists. If that is right what other information does it contain? And what is its role in ASSM? Also some sources show it as the first block of the first extent when an object is created which is subject to change as DML activities occur on the object. In ASSM it is often depicted as the 3rd block with the first 2blocks as L1 and L2 bitmap blocks respectively. I greatly appreciate your explanation on its location in both MSSM and ASSM blocks.
        2. What is transaction freelist? I can get any clear-cut explanation on that. If it is the same as Interested Transaction List then your response to only question 1 is appreciated.

        I grateful for your time in spite of your tight schedule. Thanks in advance.

        Comment by Ernest — November 8, 2012 @ 10:02 am BST Nov 8,2012 | Reply

        • Ernest,

          1) In both ASSM and freelist management (MSSM) the segment header holds (amongst other bits) the extent map for the segment – this is the list of extents that belongs to the segment and, for locally managed tablespaces, is what allows Oracle to avoid recording extents in the dictionary tables fet$ (free extents) and uet$ (used extents). The ASSM equivalent of the freelist pointers would (very roughly) be a map of the various space management bitmap blocks, with hints about which is the next most desirable 1st, 2nd, or 3rd level bitmap to examine for free space.

          2) Freelists come in three flavours: master freelist (one per freelist group block), segment freelists (as many as specified in the object storage parameter freelists and transaction freelists (up to a fairly large number, constrainted by the block size, number of segment freelists, and whether or not you’re using freelist group blocks: it could be as high as 307 for an 8KB block size).

          When a session makes enough free space in a block for it to be reusable (i.e. it crosses the pctused boundary) the session picks one of the transaction freelists and attaches the block to it. (I won’t go into the mess of what happens if the session rolls back.)

          When a session needs a free block, it uses a hashing function based on the process id to select which segment free list it should be using, and takes a block from the top of that list. If there are no blocks on that segment freelist the session takes 5 blocks from the master freelist and attaches them to the segment free list; if there are no blocks on the master freelist the session sweeps the transaction freelists (for committed transactions) and moves their blocks to the master freelist; if this step doesn’t give it any free blocks the session bumps the segment high water mark (which may mean adding an extent) and formats 5 new blocks that can then be attached to its segment freelist.

          Comment by Jonathan Lewis — December 31, 2012 @ 11:37 am BST Dec 31,2012

  17. Hello Jonathan,

    To the end of page 108, when discussing the mechanisms to move buffers from the REPL_MAIN to the REPL_AUX list, you write “I would assume that a buffer will only be transferred if the reader can get the relevant cache buffer chains latch in immediate mode…”.
    Why is it necessary to get hold of the cache buffer chains latch in order to transfer a buffer from REPL_MAIN to REPL_AUX? As far as I understood, the buffer content is not changed by this operation (i.e. the buffer still holds the same block after it has been moved to REPL_AUX). I would therefore expect the buffer to remain attached to the same hash bucket and I thought that the cache buffer chain latches only purpose was to protect the hash chains against concurrent addition/removal of buffers.

    thanks a lot
    kind regards
    Martin

    Comment by Martin Maletinsky — January 8, 2013 @ 10:49 pm BST Jan 8,2013 | Reply

    • Martin,

      Your comment makes sense, and right now I can’t think of the reason that prompted me to make that comment – I should have explained it when I wrote it.

      Possibly it was the thought that the transfer shouldn’t take place if the block was pinned (because it’s in use) or – and this is the point of the immediate bit of the latch get – if another session is doing a “consistent get – examination”.

      Update (March): It’s amazing how easy it is to head off in the wrong direction because of the way a question is asked. I fooled myself into a state of confusion because Martin pointed out that “the buffer was not changed” – but the cache buffers chains latch doesn’t act to protect the BUFFER, it acts to protect the buffer HEADER. So any change to the buffer header (including the change to the repl_main / repl_aux linked list pointers) has to be protected by the cache buffers chains latch.

      Comment by Jonathan Lewis — January 12, 2013 @ 12:27 pm BST Jan 12,2013 | Reply

  18. Hello Jonathan,

    On page 110 you write:
    “If I wanted to superimpose the replacement lists on Figure 5-6, they would appear as two tangled lines weaving their way randomly through the buffer headers, obeying only the rules that (a) every buffer header would be on one of the lines …”.

    Is that really correct? You may have buffers that are on neither of the REPL_MAIN, REPL_AUX chains (otherwise the fileds CNUM_SET, CNUM_REPL and CNUM_AUX in x$kcbwds would be redundant). This can happen e.g. in case buffers are moved from the REPL_ lists to the WRITE_MAIN list as described in chapter 6, page 144ff – If I understand the mechanisms correctly, those buffers might (will?) remain on their hash chain (?). This would however violate the claim “every buffer header would be on one of the lines”.

    thank you
    kind regards
    Martin

    Comment by Martin Maletinsky — January 9, 2013 @ 10:06 pm BST Jan 9,2013 | Reply

    • Martin,

      You’re correct, the statement is fundamentally wrong. It’s one of the “simplifications” I present all the way through the book, but it’s one that I should have stated was a simplification and supplied with a forward reference to the write lists et. al.

      Comment by Jonathan Lewis — January 12, 2013 @ 12:54 pm BST Jan 12,2013 | Reply

  19. Hello Jonathan,

    When reading your comment on consistent gets – examination on page 114, two questions came into my mind. I have a hypothesis on the answer and I’d appreciate if you could confirm my hypothesis or tell me why I am wrong.
    i) What qualifies a logical read to be done at “half-price”, i.e. with only one latch get and without pinning the buffer?
    ii) How does Oracle ensure there is no concurrent modification of the buffer while it is being read with a “half-price” logical IO, i.e. not protected by a pin? Concurrent modification of the buffer while it is being read might provide the reader with an inconsistent view of the buffer content (it might see only part of the modifications made to the buffer).

    I believe the answer to the two questions is as follows, but I haven’t verified it with any tests:
    When Oracle expects that inspection of the buffer content can be completed within a short time, it does this inspection under protection of the cache buffers chains latch protecting the hash chain the buffer is currently linked on. I.e. after finding the buffer and while holding the cache buffers chains latch, Oracle checks that no exclusive pin is attached to the buffer. If this check succeeds, Oracle will inspect the content of the buffer (while still holding the cache buffers chains latch) and release the cache buffers chains latch, once the inspection of the buffer content is completed.
    A separate code path must exist to handle the case where the buffer is pinned exclusively and thus may be subject to ongoing modifications – it doesn’t seem unreasonable to me to assume that Oracle will fall back to the ordinary logical IO in this case, ie. attach a shared pin to the waiters list of the buffer and wait for the exclusive pin to be released.

    thank you for reviewing this statement
    kind regards
    Martin

    Comment by Martin Maletinsky — January 22, 2013 @ 8:58 pm BST Jan 22,2013 | Reply

    • Martin,
      I think you’ve covered most of the points.

      There is also the detail that the examination is attempted only if Oracle thinks that it is likely to be completed in a short time. For example when accessing an undo block to find an undo record, evey detail about where that record will be is fixed and there’s no chance Oracle will have to do the equivalent of something like a block cleanout to find the right version of the record.

      Essentially, it probably comes down to a trade-off between the time it takes to create, attach and subsequently release a pin, compared to the time it will probably take to locate and extract the correct piece of information from the block.

      Comment by Jonathan Lewis — January 28, 2013 @ 6:21 pm BST Jan 28,2013 | Reply

  20. Hello Jonathan,

    On page 115 you mention that Oracle does “switch current to new buffer” and you mention this mechanism again on page 116. Could you please briefly sketch what this mechanism is about or give a link where it is documented?

    thanks a lot
    kind regards
    Martin

    Comment by Martin Maletinsky — January 22, 2013 @ 9:04 pm BST Jan 22,2013 | Reply

    • Martin,

      I don’t think there’s a lot to be said about this mechanism.

      Basically a session may clone a CUR buffer before modifying it. The previous CUR copy is switched to CR as this takes place. The easiest way to see this is create a table with a few hundred block then update a few rows in the table by tablescan; each time you do the update you’ll see that the blocks you’ve updated are cloned before the update is applied.

      Comment by Jonathan Lewis — January 28, 2013 @ 6:33 pm BST Jan 28,2013 | Reply

      • Hello Jonathan,

        Thank you for your reply. What is the benefit of this processing?
        You need to find an available buffer, link it into the appropriate hash buffer chain, do the actual cloning (memory copy), and potentially exchange the old buffer on the checkpoint queue with the new current buffer (as described in chapter 6, page 140) – i.e. quite some additional work. On top of it this makes the hash buffer chains longer (because there is an additional copy of the same data block, which might never be used again) and it requires additional space in the buffer cache.
        On the plus side I only see that other sessions that need a CR copy with a target SCN prior to the modification can access the old buffer, rather than having to create a CR copy by themselves. Is this the entire justification for the “switch current to new buffer” mechanism or are there any additional benefits?

        thank you
        kind regards
        Martin

        Comment by Martin — March 16, 2013 @ 10:49 am BST Mar 16,2013 | Reply

  21. Hello,

    I found the note on top of page 115 a bit short and didn’t fully understand what it was saying. When browsing Internet to find some clarification I came across the following blog that Jonathan has written and which helped me to fully understand the note. I just post the link here in the hope it may be useful to other readers.
    http://www.jlcomp.demon.co.uk/buffer_handles.html

    kind regards
    Martin

    Comment by Martin Maletinsky — January 23, 2013 @ 7:16 pm BST Jan 23,2013 | Reply

  22. Hello Jonathan,

    I have some questions regarding the “Tablescans” section at the end of chapter 5.

    When you describe the 8i and 9i strategy for “small tables”, you say the reads during scans of small tables were treated just as other reads but the touch count was not incremented when reading the blocks (nor during subsequent visits). This leads me to the following questions:
    i) Is a newly read buffer linked at the midpoint into REPL_MAIN (i.e. at the location pointed to by cold_hd) as with other reads?
    ii) In the “LRU/TCH” section starting on page 104 you don’t mention that the touch count is incremented for blocks that are (normally) read into the cache and linked at the midpoint into REPL_MAIN. Rather the touch count is set to 1 for these blocks. So what does it mean when you say that the touch count is not incremented for blocks read during a full scan of a small table – is the touch count set to 0?
    iii) How does Oracle ensure that the touch count of blocks read during a full scan of a small table is not incremented during subsequent visits? Is there a flag on the buffer preventing such increments or is it maybe the touch count value 0 that indicates the block was read in by a full scan of a small table (see my guess in question ii).

    vi) When you write “…the table was larger than the 2 percent limit, then the buffers used to read the table were immediately dropped to the end of the LRU list..”, does “end of the LRU list” mean REPL_AUX in this context, or the LRU end of REPL_MAIN?

    thank you
    kind regards
    Martin

    Comment by Martin Maletinsky — January 23, 2013 @ 9:50 pm BST Jan 23,2013 | Reply

    • Martin,

      i) & ii) I had to go back and read some old test scripts to check – and they say that the tch is not set (i.e. starts at zero) in 8i and 9i for tables less than 2% of the cache size, and that the blocks were loaded to the midpoint.

      iiI) There is a bitmapped FLAGS column on the buffer header, and bit 19 is named “only_sequential_read”. This is set if the table was subject to a full tablescan. I have always assumed that this is what stopped subsequent buffer accesses from incrementing the tch.

      iv) good question – it got a lot more complicated in recent versions – I suspect in the older versions when we said “dropped to the end of the LRU” we should have been saying “transferred to the repl_aux”. Have you read the investigation into “small” tables I did a couple of years ago ? http://jonathanlewis.wordpress.com/2011/03/24/small-tables/

      Comment by Jonathan Lewis — January 28, 2013 @ 7:54 pm BST Jan 28,2013 | Reply

  23. Jonathan,

    Chapter 5 is one of my favorite from your book (with 2 and 3), there is couple of very interesting useful (often even amazing) details.
    Please let me ask one additional question regarding Shared vs. Exclusive “cache buffers chain” latches. Surprisingly, oracle seems to grab exclusive latch when buffer is pinned (and Shared CBC latch for single-get visits), and still don’t have satisfactory explanation for that. Why is not adding shared pin enough?

    Comment by Pavol Babel — October 8, 2013 @ 12:16 am BST Oct 8,2013 | Reply

    • … I mean when buffer is pinned, exclusive CBC latch is grabbed. Can’t find reason why it is not enough to get shared CBC latch + add shared pin to buffer header…

      Comment by Pavol Babel — October 8, 2013 @ 12:37 am BST Oct 8,2013 | Reply

    • Pavol,

      If you want to pin a buffer you’re changing the content of the chain – so you have to take it in exclusive rather than shared mode otherwise someone else could take it in shared mode and attempt to pin the same buffer at the same time, and that would allow them to mess up the doubly-linked list of buffer pins.

      Comment by Jonathan Lewis — October 11, 2013 @ 8:05 pm BST Oct 11,2013 | Reply

      • Jonathan,

        could you go into little deeper details please? When pinning, we need to grab buffer handle (hopefully we have some reserved), and change pointers US_NXT and US_PRV inside buffer header to point to new pin.
        So it is true we are changing content of the buffer chain (changing content of buffer header), but we are not relinking (deleting / addding buffer from/to chain) it.

        Comment by Pavol Babel — October 11, 2013 @ 9:31 pm BST Oct 11,2013 | Reply

      • Jonathan,

        of course you can ignore my previous post ;) I think I get the point now. Although it is not possible to get two shared latches on same buffer chain at the same time (one latch will be get always first, since latch waiter list has to be maintained for future possible exclusive waiters), both processes could have easily met on buffer pins list of the same buffer header.
        In other words, the buffer pin list of each buffer header is protected only by CBC latch, so exclusive CBC is definitely needed.

        Comment by Pavol Babel — October 11, 2013 @ 9:42 pm BST Oct 11,2013 | Reply


RSS feed for comments on this post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,910 other followers