OC 5 Caches and Copies

Addenda and Errata for Oracle Core Chapter 5 Caches and Copies

Addenda

p.94	Second paragraph, I make the comment: “…it’s possible that there are other variations on limits and granule sizes that I haven’t yet come across.” Charles Hooper has added a few more variations in a comment he made on the Glossary, which links to a note he published in May 2011.

Errata

p.94	first para, first line: “automatic system memory management” should be “automatic shared* memory management”*.
p.101	The Note: “set parameter _disable_cpu_check to false” should be “set parameter _disable_cpu_check to true“.
p.103	Figure 5.4: In the first of the two boxes labelled x$kcbwds, the far right-hand entry should be prv_replax rather than prv_repl.
p.106	Penultimate paragraph: I say that the touch count is halved as a buffer moves from the cold end of the LRU to the hot end. This is incorrect, the touch count drops to zero. (I have some notes from the very early days of the LRU/TCH (i.e. 8.1) which say I saw the halving effect, but even in 8.1.7 the parameter _db_aging_stay_count exists, is set to zero, and has the description “Touch count set when buffer moved to head of replacement list”. Of course, there are many cases where zero means “use a default internal value or action”, but it seems unlikely in this case.
p.107	Note at top of page: “changes the touch count on that buffer to zero”, should be “changes the touch count on that buffer to 1”.
p.110	Second para: “(a) every buffer header would be on one of the (repl) lines”. This is a simplification (i.e. it’s not true) aimed at highlighting the basic function of the LRU lists and latches. I should have included some sort of warning or forward reference to the discussion of the other linked lists such as the write lists that are described in ch. 6. See Martin’s comment below
p.114	Section “Logical I/O” 3rd Para (At this point…), second line: “consistent get – examination” should be “consistent gets* – examination”*
p.115	Section “Updates”, 2nd para, second line: “`buffer busy wait`“ should be “`buffer busy waits` wait“.Section “Updates”, last para, penultimate line: “… older clones on to the replacement list if you …” should be “… older clones on to the auxiliary* replacement list if you …”*
p.115	Section “Loading a Hash Chain” says: “As I commented in Chapter 4, a process is not allowed to request a latch in willing-to-wait mode if it is already holding a lower-level latch.” This statement is the wrong way round (see note below from Sid)- you cannot request a latch in willing-to-wait mode if you are already holding a higher level latch; this error makes the subsequent comments about the complexity involved in dropping and re-acquiring cache buffers chains latches irrelevant – you don’t have to drop the latch.There is another odd error in the same sentence – I don’t say anything about the latch level and it’s use in controlling the order in which willing-to-wait gets can be made.
p.118	2nd para after the Note: “If the optimizer thought …” should be “If the runtime engine* thought …”*

Typos/Grammar/Style/Punctuation

Back to Index

Comments (72)

72 Comments »

Hi Jonathan,

On page 3 of Chapter 5,you had mentioned the following:
“Note: Buffer headers are tied very closely to buffers – there is a permanent one-to-one link between a row in the array of buffers and the corresponding row in the array of buffer headers”.

does it mean that the total number of rows in x$bh is same as the total number of blocks cached in memory (_db_block_buffers)?
In my 11gr2 database on windows 32-bit OS with the sga_target=252M,I am getting the following result.

SQL> select count(*) from x$bh;

COUNT(*)
———-
13218

NAME VALUE DESCRIPTION
——————————————— —————————— —————————————————
_db_block_buffers 15376 Number of database blocks cached in memory: hidden parameter

Thank you.
Antony

Comment by orapsdbaAntony — September 29, 2011 @ 6:42 pm BST Sep 29,2011 | Reply
- Antony,
  
  In general (at steady state) I would expect the number of rows in x$bh to match the number of blocks that could be cached – but (a) I wouldn’t trust _db_block_buffers to be telling the truth, and (b) there are transitional states where I’m not 100% sure how Oracle would behave. One obvious transition is in the interval after startup (which I do know about), another is during a deferred resize operation when (for example) you might see data buffer space allocated in the shared pool as “KGH: No Access” entries.
  
  To give you some ideas of cross-checking: shortly after starting up an 11.2.0.2 instance, when I can see that _db_block_buffers = 30256, I get the following results:
```
SQL> select count(*) from x$bh;

  COUNT(*)
----------
      7449


SQL> select set_msize from v$Buffer_pool_statistics;

SET_MSIZE
---------
    28768

SQL> select cnum_repl, anum_repl from x$kcbwds where cnum_repl != 0;

 CNUM_REPL  ANUM_REPL
---------- ----------
     14384      10633
     14384      10644

SQL> select current_size, granule_size, current_size/granule_size
  2  from v$memory_dynamic_components where component = 'DEFAULT buffer cache'
;

CURRENT_SIZE GRANULE_SIZE CURRENT_SIZE/GRANULE_SIZE
------------ ------------ -------------------------
   243269632      4194304                        58
```
  You can see that x$bh currently doesn’t seem to have an entry for every buffer in the buffer pools, and v$buffer_pool_statistics does agree with the sum of cnum_repl from the working data set, and the working data set shows you that the value in x$bh matches the sum of cnum_repl – anum_repl (i.e x$bh is currently not showing any rows for buffers which are in the auxiliary replacement list, which – at this moment in time – is the buffers that have not yet been used at all).
  
  As a follow-on, the granule information shows that there are 58 granules in the default buffer cache – and 30,256 does not divide exactly by 58, so there is clearly something wrong with the way _db_block_buffers is derived.
  
  Comment by Jonathan Lewis — September 29, 2011 @ 9:22 pm BST Sep 29,2011 | Reply
  - Hi Jonathan,
    
    Thank you very much!
    I am amazed by the quick response considering your very tight schedule.
    I am eagerly waiting (of course with lots of doubts) to meet you in person on 6th of October when you will be presenting “Indexing Strategies” at NYOUG
    
    Antony
    
    Comment by Antony — September 29, 2011 @ 11:44 pm BST Sep 29,2011 | Reply
Hi Jonathan,

In the 5th chapter – 23 page , in the figure -5.6 (Oracle buffer Cache)
How could I find the Granule and working list ?
What is Granule and working list in the figure ?

Many thanks

Comment by berusadla — October 10, 2011 @ 1:25 pm BST Oct 10,2011 | Reply
- Berusadla,
  
  Remember that a “working data set” is a linked list that cuts across the set of granules that makes up a buffer pool (see figure 5-3), and note that I made the point just below figure 5-6:
  
  “If I wanted to superimpose the replacement lists on Figure 5-6, they would appear as two tangled lines weaving their way randomly through the buffer headers …”
  
  It would be virtually impossible to draw a readable picture of a granule or a working data set on top of figure 5-6. In effect a granule would be a random collection of buffers (or buffer headers) from that figure, and a working data set would be part of that random set, and part of another random set representing a different granule, and so on.
  
  The key point to remember is that ANY buffer (header) from ANY working data set, granule or buffer pool could at some time be attached to a given cache buffers chains – and just to make the point explcitly, a single cache buffers chain could hold buffers for blocks of different block sizes.
  
  Comment by Jonathan Lewis — November 13, 2011 @ 11:40 am GMT Nov 13,2011 | Reply
Hi Jonathan,

Granules - a physical chunk of memory – which has an array of buffer headers and buffer pointed by the buffer headers.
Working list (logical) -> is a set of buffer headers from multiple Granules.

This working list (set of buffer headers) is ORGANISED as 2 lists
1.Replacement list
2.Auxiliary replacement list

Up to this clear –
You said in the ‘summary’ —
“There is a second structure imposed on the content of the data cache, again employing
linked lists (but very short ones), that uses the data block address of the buffered blocks to
scatter the buffers across a very large hash table”

1.How HASH table and Working Set are related ??
2.what are these linked lists ..?? Are they nothing but ‘Replacement /Auxiliary Replacement)?

Many thanks for your time

Comment by berusadla — October 10, 2011 @ 4:31 pm BST Oct 10,2011 | Reply
- Sorry Jonathan, I went through your Chapters 4 and 5 completely couple of times and understood now , Please ignore the above questions and sorry for the troubles. But may I request to you to correct if my understanding (from your book) is wrong : abrief summary :
  
  1.SGA is a collection of GRANULES (a chunk of memory) – this GRANULES were introduced for the efficient move of memory among all the areas of SGA.
  2.DB cache is also made up of these GRANULEs which are linked through POINTERS .. each GRANULE contains aan array of BUFFER HEADERS and also array of Buffers separately
  
  3.To avoid the contention, heavy traffic .. to make things FASTER – a concept of Working Data set is introduced – this lies across the Granules.. basically this is a Linked list of Buffer Headers from multiple Granules and one DBWriter process is allocated to this WDS.
  
  4.Copying into memory(buffer), Writing buffer to Disc are done efficiently with help of working Dataset /DWriter process but How to find a block is already in memory – any technique to FIND ??? here it comes HASHing technique.. HASH algorithm , a tiny data structure attached to the buffer linked lists which helps us to TRACE the block if it is present in memory very QUICKLY ..
  
  Many thanks Jonathan for EXCELLENT BOOK ( I don’t say it is ONE OF the EXCELLENT BOOKS – but I say No 1-EXCELLENT book – that’s it)
  
  I already recommended this book for 5 of my friends and they bought it already 
  
  Comment by berusadla — October 12, 2011 @ 3:36 pm BST Oct 12,2011 | Reply
  - Berusadla,
    
    I’m busy trying to catch up with all the comment from the last 6 weeks or so and didn’t realise that you had answered your own questions.
    
    Your understanding is basically correct, but for point 4, I wouldn’t call the hash table a “tiny” data structure. The hash table has (roughly) twice as many buckets are there are buffers, and every buffer that is in use is associated with a hash bucket; if there are several buffers associated with the same hash bucket they are connected together through a (short) linked list.
    
    Comment by Jonathan Lewis — November 13, 2011 @ 11:51 am GMT Nov 13,2011 | Reply
- Berusadla,
  
  The hash table and the working data sets are not related – see my answer to your previous comment.
  The hash table is the core memory structure of the cache buffers chains – each chain is based on a single hash bucket in the hash table.
  The linked list are the very short cache buffers chains.
  
  Comment by Jonathan Lewis — November 13, 2011 @ 11:43 am GMT Nov 13,2011 | Reply
Jonathan,

I might have found a couple of additional items for the errata list.

* The first sentence on page 94 states, “Oracle Corp. introduced automatic system memory management (ASMM) in 10g…” The Oracle documentation (http://docs.oracle.com/cd/E11882_01/server.112/e16638/memory.htm#i58728) and seemingly most other resources that expand the ASMM acronym indicate that the ASMM acronym is short for Automatic Shared Memory Management. Is this a typo in the book? You stated “system” rather than “shared” in an OTN thread on Sep 19, 2011, so I might be mistaken https://forums.oracle.com/forums/thread.jspa?threadID=2285548 ).

* In the Note section on page 101, the book describes how the CPU_COUNT parameter can be changed to determine the effect on memory allocations as the number of CPUs in the server changes. The note in the book states, “… however, in 11.2 you also need to set parameter _disable_cpu_check to false.” In 11.2.0.2 it appears that the _DISABLE_CPU_CHECK parameter defaults to FALSE, and (because the parameter is prefixed with DISABLE) that the note in the book should state, “… however, in 11.2 you also need to set parameter _disable_cpu_check to TRUE.”

Comment by Charles Hooper — December 18, 2011 @ 5:43 pm GMT Dec 18,2011 | Reply
- Charles,
  
  Thanks, correct on both counts. I’ve added both to the errata.
  The system/shared switch is one of those little brain glitches that I manage to repeat fairly frequently. Luckily I did get it right in the glossary.
  
  Comment by Jonathan Lewis — December 19, 2011 @ 12:46 pm GMT Dec 19,2011 | Reply
  - The majority of the book is a bit beyond my knowledge of Oracle Database internals – if I stand on a 30 feet (9 meter) tall step ladder, some of the material still manages to go over my head. The book will require a second (or third) read through for full comprehension. In short – thanks for pushing the boundaries for “how it works”.
    
    0 detected logic errors with a couple of accidental word substitutions in the first 130 pages is significantly better than most (maybe all) of the other books that I have read in the last couple of years.
    
    Comment by Charles Hooper — December 19, 2011 @ 1:15 pm GMT Dec 19,2011 | Reply
A couple of other very minor potential issues:

* On page 114, the third paragraph under the heading Logical I/O states: “At this point we might be able to examine the block and drop the latch immediately. This action is reported in the instance statistics as a consistent get – examination;…”. It appears that the letter S was accidentally dropped from the word get in the statistic name (pages 44 and 51 also mention the consistent gets – examination statistic).

* On page 115, the second paragraph under the heading Updates states: “… pin it and we might have to attach our pin to the waiters’ list and post a buffer busy wait.” This sentence is understandable, but might have been intended to state, “and post a buffer busy waits wait event.”

* On page 118, the sentence in the middle of the page begins, “If the optimizer thought the table was larger than the 2 percent limit, then the buffers used to read the table were immediately dropped to the end of the LRU list as the blocks were read…” The previous paragraph describes how the runtime engine behaves when a table smaller than 2 percent of the buffer cache is full scanned. It appears that the word optimizer should be replaced with runtime engine.

Comment by Charles Hooper — December 18, 2011 @ 9:18 pm GMT Dec 18,2011 | Reply
- Charles,
  
  Thanks again – correct on all three counts (and I found two more occurrences of the missing S in chapter 3).
  Now added to the errata.
  
  Comment by Jonathan Lewis — December 19, 2011 @ 12:44 pm GMT Dec 19,2011 | Reply

Jonathan,

At the end of the section “Updates”
—quote—
So, to avoid getting too much garbage into the cache, and to keep the length of the hash chain short, Oracle will start switching the older clones on to the replacement list if you continue updating the block.
—end of quote—

In my understanding, the exceeding CR blocks should be moved from the main replacement list to auxiliary replacement list, to mark it free for reuse, right?

I do a little test on my 10.2.0.4 on OS X to verify: the default value of _db_block_max_cr_dba is 6. I keep update the block 10 times, and max cr versions on the hash chain is 5, just as you mentioned. when the current version is switched to CR versions, the TCH is increased from 1 to 2.

drop table t purge;

create table t
as select object_id, object_name from all_objects where rownum <= 10;

exec dbms_stats.gather_table_stats(user,'T');

exec snap_stats.start_snap;
exec snap_stats.start_snap;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
pause;
update t set object_id = 65536;
set serveroutput on
exec snap_stats.end_snap;

sys@CS10G> @rowid oe.t rownum=1;
    RFILE#     BLOCK#
---------- ----------
         6     141450

sys@CS10G> @dba3 6 141450

STATE      BLOCK_CLASS        OBJECT_TYPE         object          TCH  MODE_HELD D T P S D FLG_LRUFLG
---------- ------------------ ------------------- -------- ---------- ---------- - - - - - ----------
xcur       data block         TABLE               OE.T              1          0 Y N N N N 2002001:0
cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2
cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2
cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2
cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2
cr         data block         TABLE               OE.T              2          0 N N N N N 2002000:2 

6 rows selected.

sys@CS10G> select n.ksppinm p_name, c.ksppstvl p_value
  2  from sys.x$ksppi n, sys.x$ksppcv c
  3  where n.indx=c.indx
  4  and lower(n.ksppinm) like lower('%&1%');

NAME                                     VALUE
---------------------------------------- ----------------------------------------
_db_block_max_cr_dba                     6

Comment by Sid — December 31, 2011 @ 2:40 am GMT Dec 31,2011 | Reply

Sid,

Thanks for that. Yes, the text should have said: “… on to the auxiliary replacement list …”.
As you’ve shown in your results, although the parameter suggests you can get 6 CR copies, I tend to see five CR and one current.

Comment by Jonathan Lewis — January 3, 2012 @ 7:35 pm GMT Jan 3,2012 | Reply

Hi Jonathan,

Here is some thoughts and testing about the Tablescan section.

1.
The threshhold is 10 percent, when table scan changes from (short tables) to (long tables), it’s not 2 percent, I guess the _small_table_threshhold just control whether you can do direct read on the table, it may have nothing to do with how to cache the data.

2. When the data blocks less than 10 percent of the buffer pool, the blocks are cached and subsequent tablescan doesn’t need to do physical read. The TCH is incremented per access on 3 seconds interval. It’s expected.

3. When the data blocks between 10 and 25 percent of the buffer pool, the blocks are cached and subsequent tablescan doesn’t need to do physical read, or just do several physical reads. But the subsequent tablescan is not marked as short tablescan and the TCH is not increases as you described. “however if you repeat the tablescan while the blocks are still in the data cache, the touch count on the buffers will be incremented at that point and the tablescan will be reported as a short tablescan”.

4. When the data blocks slightly excced 25 percent of the buffer pool, the blocks are not cached and subsequent tablescan needs to do physical read. The TCH is not incremented as well, just as mentioned. The result and 25 threshhold match the testing by Dion Cho

5. I try to cheat the CBO by hacking the numblks to number exceeding the buffer pool size, or to every small number, using dbms_stats.set_table_stats, but the runtime engine choose the right scan method according to the real data blocks. I guess the runtime engine will check the extent mapping table and high water mark on the segment header block, to get the exact data blocks under HWM.

6. The test on tablespace ASSM show the same result, except the threshhold need to increased a bit,to include the blocks comes from 1st level bmp ans 2st level bmp. Automatic Shared Memory Management(ASMM) is disabled on my testing, the result with ASMM, or on different versions/platforms may vary.

Here comes the test, 10.2.0.4.0 on OS X 10.6, db_cache_size around 96M.

sys@CS10G> select * from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod
PL/SQL Release 10.2.0.4.0 - Production
CORE	10.2.0.4.0	Production
TNS for MacOS X Server: Version 10.2.0.4.0 - Production
NLSRTL Version 10.2.0.4.0 - Production


sys@CS10G> select buffers, buffers/50 "2_percent", buffers/10 "10_percent", buffers/4 "25_percent" from v$buffer_pool;

   BUFFERS  2_percent 10_percent 25_percent
---------- ---------- ---------- ----------
     12350        247       1235     3087.5

The code to do test, create table big_table and T. The big table is to fill up the buffer_cache; capture the statisitcs for tablescan on T, sleep 3 seconds between table scan on table T, see if the TCH incremented. Finally connect to sys, check the X$bh and dba_objects to check the TCH on data blocks.

conn oe/oe
set serveroutput on
drop table T purge;
drop table big_table purge;

-- test on tablespace MSSM, without ASMM
create table T
pctfree 90
pctused 10 
tablespace test_8k
as select 
rownum n1,
lpad(rownum, 1000,'0') padding
from dual
connect by level <= 247;

--connect by level <= 247;	--2%,  table scan (small tables), blocks are cached,    TCH is incremented
--connect by level <= 1235;	--10%, table scan (long tables),  blocks are cached,    TCH is not incremented
--connect by level <= 3190;	--25%, table scan (long tables),  blocks are not cache, TCH is not incremented

create table big_table
pctfree 90
pctused 10 
tablespace test_8k
as select 
rownum n1,
lpad(rownum, 1000,'0') padding
from dual
connect by level <= 12350;

exec dbms_stats.gather_table_stats(user,'T');
exec dbms_stats.gather_table_stats(user,'big_table');

--flush the buffer_cache
alter system flush buffer_cache;

--fill up the buffer_cache
select count(*) from big_table;

exec snap_stats.start_snap;
select count(*) from T;
exec snap_stats.end_snap;

exec dbms_lock.sleep(3);

exec snap_stats.start_snap;
select count(*) from T;
exec snap_stats.end_snap;

exec dbms_lock.sleep(3);

exec snap_stats.start_snap;
select count(*) from T;
exec snap_stats.end_snap;

conn / as sysdba
col dba_object head object for a5 truncate

select block_class, object_type, dba_object, tch, count(*)
from
(select  /*+ ORDERED */
        decode(bh.class,1,'data block',2,'sort block',3,'save undo block',
               4,'segment header',5,'save undo header',6,'free list',7,'extent map',
               8,'1st level bmb',9,'2nd level bmb',10,'3rd level bmb', 11,'bitmap block',
               12,'bitmap index block',13,'file header block',14,'unused',
               15,'system undo header',16,'system undo block', 17,'undo header',
               18,'undo block'
        ) block_class,
        o.object_type,
        o.owner||'.'||o.object_name             dba_object,
        bh.tch
from
        x$bh            bh,
        dba_objects     o
where
        bh.obj = o.data_object_id
and     o.data_object_id in ( select data_object_id from all_objects where object_name='T' and owner='OE')
)
group by block_class, object_type, dba_object, tch
/

data blocks = 247; 2%, table scan (small tables), blocks are cached, TCH is incremented

  COUNT(*)
----------
       247

---------------------------------
System stats:-        31-Dec 15:33:52
Interval:-        0 seconds
---------------------------------
Name                                                 Value
----                                                 -----
consistent gets                                         256
physical reads                                          248
physical reads cache                                    248
table scans (short tables)                                1
table scan blocks gotten                                247

---------------------------------
System stats:-        31-Dec 15:33:55
Interval:-        0 seconds
---------------------------------
Name                                                 Value
----                                                 -----
consistent gets                                        256
table scans (short tables)                               1
table scan blocks gotten                               247

---------------------------------
System stats:-        31-Dec 15:33:58
Interval:-        0 seconds
---------------------------------
Name                                                 Value
----                                                 -----
consistent gets                                        256
table scans (short tables)                               1
table scan blocks gotten                               247

BLOCK_CLASS            OBJECT_TYPE objec        TCH   COUNT(*)
------------------ --------------- ----- ---------- ----------
data block                   TABLE OE.T           3        247
segment header               TABLE OE.T           3          1

data blocks = 1235; 10%, table scan (long tables), blocks are cached, TCH is not incremented

  COUNT(*)
----------
      1235

---------------------------------
System stats:-        31-Dec 15:43:56
Interval:-        0 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
consistent gets                                                          1,244
physical reads                                                           1,236
physical reads cache                                                     1,236
table scans (long tables)                                                    1
table scan blocks gotten                                                 1,235

---------------------------------
System stats:-        31-Dec 15:43:59
Interval:-        0 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
consistent gets                                                          1,244
table scans (long tables)                                                    1
table scan blocks gotten                                                 1,235

---------------------------------
System stats:-        31-Dec 15:44:02
Interval:-        0 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
consistent gets                                                          1,244
table scans (long tables)                                                    1
table scan blocks gotten                                                 1,235

BLOCK_CLASS           OBJECT_TYPE objec        TCH   COUNT(*)
------------------ -------------- ----- ---------- ----------
data block         TABLE          OE.T           0       1235
segment header     TABLE          OE.T           3          1

data blocks = 3190; 25%, table scan (long tables), blocks are not cache, TCH is not incremented

  COUNT(*)
----------
      3190

---------------------------------
System stats:-    31-Dec 15:48:41
Interval:-        0 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
consistent gets                                                          3,201
physical reads                                                           3,191
table scans (long tables)                                                    1
table scan blocks gotten                                                 3,190

---------------------------------
System stats:-    31-Dec 15:48:45
Interval:-        0 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
consistent gets                                                          3,201
physical reads                                                           3,190
physical reads cache                                                     3,190
table scans (long tables)                                                    1
table scan blocks gotten                                                 3,190

---------------------------------
System stats:-    31-Dec 15:48:49
Interval:-        0 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
consistent gets                                                          3,204
physical reads                                                           3,190
physical reads cache                                                     3,190
table scans (long tables)                                                    1
table scan blocks gotten                                                 3,190


BLOCK_CLASS        OBJECT_TYPE         objec        TCH   COUNT(*)
------------------ ------------------- ----- ---------- ----------
data block         TABLE               OE.T           0       3084
segment header     TABLE               OE.T           4          1

Comment by Sid — December 31, 2011 @ 8:28 am GMT Dec 31,2011 | Reply

Sid,

I’ve done a little cosmetic work on your output.
Your results don’t match the ones I published a few months ago in a note on short tables: https://jonathanlewis.wordpress.com/2011/03/24/small-tables/ but mine were from 10.2.0.3 and yours from 10.2.0.4, also I had ensured that the cache was fully populated with other objects before I started the test and this can make a difference (it’s possible you did the same, but you didn’t mention it), there’s also the possibility of internal “rounding errors” – the boundaries may be a little way off our expectations.

Comment by Jonathan Lewis — January 3, 2012 @ 6:50 pm GMT Jan 3,2012 | Reply
- You are correct, I did not actually fill up the buffer pool in advance. A big fulll table scan on the *clean* buffer pool does not filled up the buffer pool as expected. I like you test case and repeat on 11.2.0.2.0, the result matches. I’ve Post the result at my blog: http://sid.gd/small-tables/
  
  Comment by Sid — February 17, 2012 @ 12:59 am GMT Feb 17,2012 | Reply

Jonathan,

In the section Relinking a Buffer, the TCH is 1, so is in the Figure 5-5
As the buffer cross the midpoint and becomes the target of the cold_hd pointer.

in Note below above statement
changes the touch count on that buffer to zero, and then modifies cold_hd to point to that buffer. << touch count should be 1 here?

This is my third round on the book, and I still enjoy it :-)

Comment by Sid — January 7, 2012 @ 2:36 pm GMT Jan 7,2012 | Reply
- Sid,
  
  I really ought to have had about 4 separate copies of that diagram so that I could show before and after images for the load and the promotion/push down. As it is, the buffer header you’re talking about needs a tch of 1 when it’s demonstrating the loading mechanism, but a touch count of zero when it’s demonstrating the promotion/pushdown effect.
  
  Comment by Jonathan Lewis — February 15, 2012 @ 2:17 am GMT Feb 15,2012 | Reply
  - Hi. Jonathan.
    i don’t know, why reset to zero.
    if buffer will be forced to cross over the threshold from the hot region to cold region,
    then oracle resets the buffer tch=1 (reset value is controlled by _db_aging_cool_count and default value 1) ?
    
    Comment by sean_kim — October 2, 2012 @ 12:58 am BST Oct 2,2012 | Reply
- Sid,
  
  I’ve just been re-reading these comments and realised I misunderstood what you were saying. It is just a simple error (which I’ve now added to the Errata) in the Note at the top of page 107. It should say “changes the touch count on that buffer to one.”
  
  Comment by Jonathan Lewis — March 10, 2013 @ 1:32 pm GMT Mar 10,2013 | Reply
Hi Jonathan,

In page 116 of the section “Loading a Hash Chain” in Chapter 5.

—quote—
```
SQL> select     name, level#
  2  from       v$latch
  3  where      name in ('cache buffers lru chain','cache buffers chains')
  4  /
 
NAME                               LEVEL#
-------------------------------    ------
cache buffers lru chain                 2
cache buffers chains                    1
```
The cache buffers chains latch has a lower level than the cache buffers lru chain latch, so we can’t request the cache buffers lru chain latch in willing-to-wait mode if we’re already holding the cache buffers chains latch. Think about what this means: we’re holding the cache buffers chains latch (which I will call the hash latch for the rest of this subsection) because we’ve just searched the hash chain for a buffer and discovered that, for whatever reason, we need to add another buffer to the chain. So we have to acquire the cache buffers lru chain latch (which I will call the lru latch for the rest of this subsection) to move a buffer from the REPL_AUX list to the midpoint of the REPL_MAIN list; but we can’t request it in willing-to-wait mode because we’re already holding a lower-level latch.

…
But if you can’t get the lru latch with an immediate get, you have to drop the hash latch, get the lru latch, and then get the hash latch again.
…
—quote—

The test from me and other folks show that Oracle will get the lru latch in willing-to-wait mode while holding the hash latch, opposite to what you suggested. I have summary the result at my blog. It’s a long post that i’ll not paste here. You may want to check it out: http://sid.gd/latch-level/

Comment by Sid — February 14, 2012 @ 1:12 pm GMT Feb 14,2012 | Reply
- Sid,
  
  Thanks for that – you’re absoutely correct, I got the significance of the latch levels the wrong way round.
  If you try to get a LOWER latch level in willing to wait mode while holding a higher latch level, then your session raises an ORA-00600 error, with parameter 1 set to 504, e.g.:
  
  ORA-00600: internal error code, arguments: [504], [0x2D5703C4], [32], [4], [business card], [2], [0], [0x0], [], [], [], []
  
  The other parameters are:
  [child latch addr] [???] [level#] [name] [child#] [??] [??]
  
  I like the way you decided to use kslgetl, rather than poke – there was always the possibility that hacking directly into memory would miss some important operation.
  
  I used the “flashback copy” latch (level 5) and the “business card” latch (level 4) on a small 11.1.0.7 system to set up the crash that produced the error above, by the way.
  
  Comment by Jonathan Lewis — February 15, 2012 @ 1:44 am GMT Feb 15,2012 | Reply
  - Jonathan,
    
    I also encountered the ORA-00600 504 if I issue query on the session holding the lru latch.
    
    ORA-00600: internal error code, arguments: [504], [0x3A76C274], [8196], [1], [session idle bit], [2], [0], [0x0], [], [], [], []
    
    If you attempt to release a latch you are not holding, 510 error will be triggered.
    
    sys@CS11GR2> oradebug peek 0x399DC524 16
    [399DC524, 399DC534) = 00000000 00002F90 00000096 00000002
    sys@CS11GR2> oradebug call kslfre 966640932
    ORA-00600: internal error code, arguments: [510], [0x399DC524], [cache buffers lru chain], [], [], [], [], [], [], [], [], []
    
    Comment by Sid — February 15, 2012 @ 5:22 am GMT Feb 15,2012 | Reply
Hi Jonathan
I love this book – and I love this section the most.
However, I was wondering what happens when the Database Smart Flash Cache features is enabled in 11g? There is now a “second tier” of buffer cache on (presumably flash) storage somewhere. Where does the COLD_HD pointer point to in this scenario? I presume that it would still point to somewhere in the “first tier” of buffer cache, i.e. somewhere in the middle of the portion that remains in memory?

Comment by Dmitri — February 14, 2012 @ 2:08 pm GMT Feb 14,2012 | Reply
- Dmitri,
  
  Thanks for the comment, I’m glad you like it.
  
  I don’t know how Oracle is handling the database flash cache, but there is a tiny reference in chapter 6, page 147, where I have a description of x$kcbwds and the various linked list pointers that reference L2R, L2W and L2K, with a follow-up note on page 148 where I mention a new cache buffers lru chains latch and surmise that there may be a parallel set of structures in the SGA doing for the smart flash cache the things that we know about for the traditional buffer cache.
  
  Comment by Jonathan Lewis — February 15, 2012 @ 2:12 am GMT Feb 15,2012 | Reply
Hello Jonathan,

Thanks for this great book. I read it with great pleasure and it clarifies many open questions I had concerning Oracle.

I have a comment regarding the discussion in chapter 5, page 95 about the reasons why Oracle choose to use two separate arrays for the buffers and buffer headers. An additional reason I believe could be driving this decision is the efficient use of hardware caches.

With all the linked lists linking the buffer headers, there will be many times when different elements from the buffer header array will be accessed consecutively. If the cache line size of any hardware cache is sufficiently large compared to the buffer header size, it will improve the cache efficiency of this cache if the buffer headers are stored in a physically contiguous memory area.

Also on average less TLB entries will be required to access any subset of all the buffer headers (e.g. when following a linked list) if the buffer headers are stored in a physically contiguous area (as opposed to having them spread throughout the entire granule).

I don’t know if a granule is physically contiguous memory (do you know?). However I believe the above reason holds even if the granules are not physically contiguous. By having the buffer headers stored in a contiguous address range (of virtual memory), they will at least physically end up in a smaller number of distinct physical memory pages than if they where scattered throughout the entire (virtual) address range covering the granule.

kind regards
Martin

Comment by Martin Maletinsky — March 19, 2012 @ 10:22 pm GMT Mar 19,2012 | Reply
- Martin,
  
  Thanks for the observation.
  
  I have to say that I usually go to James Morle or Kevin Closson when I have questions that get that close to the hardware – but it certainly sounds like the sort of thing that the Oracle developers have done in other parts of the code to make extremely frequent operations a little bit faster.
  
  Comment by Jonathan Lewis — March 31, 2012 @ 6:39 am BST Mar 31,2012 | Reply

Hello Jonathan,

I tried to experimentally verify your claim (page 116) that consistent read copies may be created in another pool than the original block belongs to. I set up the experiment you find below, but I could not find any evidence of your claim (the consistent read copies I saw were always created in the same pool as the original block). According to your note on page 116 I would have expected the consistent read copy for table_k to be created in the recycle pool – it was however created in the keep pool (see below).

Maybe I misunderstood something and there is a flaw in my experimental setup or in the query I use to verify the outcome – could you please tell me how I can modify my experiment to see current read copies being created in another pool than the orignal block belongs to?

Thanks a lot for your help
kind regards
Martin

Test Setup / participating sessions
———————————————-

I use 3 sessions in my test:

Session A: creates the tables, makes the blocks dirty and does some queries to show DB version and statistics of Session B

Session B: executes the query that needs the consistent read copies
SYSDBA SESSION: repeatedly executes the following query to view the blocks in the buffer cache

-------------------------------------------------
-- QUERY X$ VIEWS TO VIEW BUFFERED BLOCKS AS SYSDBA
----------------------------------------------------
--
COLUMN NAME      FORMAT a20 HEADING 'NAME'
COLUMN BP_NAME   FORMAT a10 HEADING 'POOL_NAME'
COLUMN STATE     FORMAT 999 HEADING 'STATE'
--
select count(*)
      ,substr(o.object_name, 1, 20) name
      ,pool.bp_name
      ,bh.state
      ,DECODE(bh.state,0,'FREE',1,'XCUR',2,'SCUR',
                       3,'CR',4,'READ',5,'MREC',6,'IREC')  state_n
  from x$bh          bh
      ,dba_objects   o
      ,x$kcbwds      wds
      ,x$kcbwbpd     pool
 where o.object_name in ('TEST_DF', 'TEST_K', 'TEST_R')
   and bh.obj=o.data_object_id
   and bh.set_ds = wds.addr
   and pool.indx = wds.pool_id
  group by o.object_name,pool.bp_name,bh.state
 order by name, state_n;


Test Execution:
---------------------

----------------------------------------------------------------------
-- 0. SESSION A - Show DB-Version and statistics for Session B (SID 6)
----------------------------------------------------------------------

K.AF234C.AVALOQ> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Solaris: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

5 rows selected.

K.AF234C.AVALOQ> 
K.AF234C.AVALOQ> 
K.AF234C.AVALOQ> select *
  2    from v$sesstat  stat
  3        ,v$statname name
  4   where stat.sid = 6
  5     and name.name in ('CR blocks created', 'switch current to new buffer')
  6     and stat.statistic# = name.statistic#;

       SID STATISTIC#      VALUE STATISTIC# NAME                                                                  CLASS    STAT_ID
---------- ---------- ---------- ---------- ---------------------------------------------------------------- ---------- ----------
         6        146          0        146 CR blocks created                                                         8  516801181
         6        148          0        148 switch current to new buffer                                              8 1312802324

2 rows selected.


----------------------------------------
-- 1. SESSION A - PREPARE TEST TABLES
----------------------------------------

create table test_df(id number(9), txt varchar2(3000)) storage(buffer_pool default);

create table test_k(id number(9), txt varchar2(3000)) storage(buffer_pool keep);

create table test_r(id number(9), txt varchar2(3000)) storage(buffer_pool recycle);

insert into test_df values(1, lpad('x', 2000, 'x'));

insert into test_k values(1, lpad('x', 2000, 'x'));

insert into test_r values(1, lpad('x', 2000, 'x'));

commit;

----------------------------------------
-- SYSDBA SESSION - view buffer cache
----------------------------------------

  COUNT(*) NAME                 POOL_NAME  STATE STAT
---------- -------------------- ---------- ----- ----
        16 TEST_DF              DEFAULT        1 XCUR
        16 TEST_K               KEEP           1 XCUR
        16 TEST_R               RECYCLE        1 XCUR

3 rows selected.

-------------------------------------------
-- 2. SESSION A - INSERT ROWS, DO NOT COMMIT;
-------------------------------------------

insert into test_df values(2,'abc');

insert into test_k values(2,'abc');

insert into test_r values(2,'abc');


----------------------------------------
-- SYSDBA SESSION - view buffer cache
----------------------------------------

  COUNT(*) NAME                 POOL_NAME  STATE STAT
---------- -------------------- ---------- ----- ----
        16 TEST_DF              DEFAULT        1 XCUR
        16 TEST_K               KEEP           1 XCUR
        16 TEST_R               RECYCLE        1 XCUR

3 rows selected.


-------------------------------------------------------------------------------------------
-- 3. SESSION B - QUERY TABLES (NEEDS TO CREATE CONSISTENT COPIES TO HIDE UNCOMMITTED CHANGES)
-------------------------------------------------------------------------------------------

select * from test_df;

select * from test_k;

select * from test_r;

----------------------------------------
-- SYSDBA SESSION - view buffer cache
----------------------------------------

  COUNT(*) NAME                 POOL_NAME  STATE STAT
---------- -------------------- ---------- ----- ----
         1 TEST_DF              DEFAULT        3 CR
        16 TEST_DF              DEFAULT        1 XCUR
         1 TEST_K               KEEP           3 CR
        16 TEST_K               KEEP           1 XCUR
         1 TEST_R               RECYCLE        3 CR
        16 TEST_R               RECYCLE        1 XCUR

6 rows selected.

----------------------------------------------------------------------------------------------
-- SESSION A - QUERY STATISTICS OF SESSION B (SID 6) to confirm increase in CR blocks created
----------------------------------------------------------------------------------------------

K.AF234C.AVALOQ> select *
  2    from v$sesstat  stat
  3        ,v$statname name
  4   where stat.sid = 6
  5     and name.name in ('CR blocks created', 'switch current to new buffer')
  6     and stat.statistic# = name.statistic#;

       SID STATISTIC#      VALUE STATISTIC# NAME                                                                  CLASS    STAT_ID
---------- ---------- ---------- ---------- ---------------------------------------------------------------- ---------- ----------
         6        146          3        146 CR blocks created                                                         8  516801181
         6        148          0        148 switch current to new buffer                                              8 1312802324

2 rows selected.

K.AF234C.AVALOQ>    

-------------------------------------------------------------------------
-- SESSION B - SHOW SID
-------------------------------------------------------------------------

K.AF234C.AVALOQ> select sys_context('userenv', 'sid') from dual;           

SYS_CONTEXT('USERENV','SID')
----------------------------------------------------------------------------------------
6

1 row selected.

Comment by Martin Maletinsky — May 14, 2012 @ 8:56 am BST May 14,2012 | Reply

I’ve run your query on a 11.2.0.3 database with configured keep/recycle buffer pools of 5/1 GB, several tables assigned to the KEEP pool and based on the output I see following: CR block copies of a segment from KEEP of DEFAULT pool may be created in KEEP/RECYCLE or DEFAULT/RECYCLE pools correspondingly.

Comment by Timur Akhmadeev — May 14, 2012 @ 12:14 pm BST May 14,2012 | Reply
- Hello Timur,
  
  Thank you for your quick reply.
  You write “may be created …” – what conditions determine if the CR copies are created in the original cache or in the recycle cache?
  
  As you see in the above output, I used 11.2.0.3 as well. Being a developer database, my database has however much smaller caches: 304 MegaByte default buffer cache and 48 MegaByte for keep and recycle cache each. Does the behaviour change with the pool size (and if so, what is the reason for that)?
  
  thanks a lot
  kind regards
  Martin
  
  Comment by Martin Maletinsky — May 14, 2012 @ 2:18 pm BST May 14,2012 | Reply
  - Hello Martin
    
    You write “may be created …” – what conditions determine if the CR copies are created in the original cache or in the recycle cache?
    
    This is beyond my knowledge. And I don’t know how to make a reliable test for such a research.
    
    Comment by Timur Akhmadeev — May 14, 2012 @ 2:37 pm BST May 14,2012
- Martin,
  
  Thanks for doing this experiment and posting the results. I can’t say why your results show the CR clones being created in the same buffer pool – but a couple of thoughts come to mind:
  
  a) the mechanism that Oracle uses for updates by tablescan is not the same as the mechanism for updates by indexed access – and, of course, the mechanisms do vary with version. Perhaps you would see a different pattern of behaviour if you did your updates by indexed access.
  
  b) it’s possible (though I think unlikely) that Oracle treats blocks with committed transactions that have to be reversed out differently from blocks which have only uncommitted transactions. If you ran your experiment but made Session B issue “set transaction read only” before doing the updates from session A you might see a difference.
  
  There’s an example of stats from a production system – and the side effects – in this article: https://jonathanlewis.wordpress.com/2006/11/21/free-buffer-waits/
  
  Comment by Jonathan Lewis — May 24, 2012 @ 12:18 pm BST May 24,2012 | Reply

Hi Jonathan,

You didn’t mention anything about the CACHE property of a table in your section on full tablescans. After reading this part of your book, I did some experiments that suggest that this only results in the touch count being incremented for full tablescans on large tables when the table is in the keep buffer pool! This surprised me as I could have sworn that I have seen performance improvements setting the CACHE attribute on largish tables that are frequently accessed with FTS on systems with no keep pool. Maybe I was dreaming. Also, my experiments suggest that index fast full scans always increment TCH regardless of the buffer pool or the CACHE attribute on the underlying table.

Comment by tonyhasler — August 9, 2012 @ 11:24 am BST Aug 9,2012 | Reply
- Tony,
  
  Thanks for the note.
  
  I think you can probably trust your memory about the behaviour – there have been so many variations on the interaction of the CACHE parameter and the various buffer pools across the last few versions of Oracle that there’s probably a specific patch release where your desrciption is absolutely correct.
  
  Comment by Jonathan Lewis — August 9, 2012 @ 12:29 pm BST Aug 9,2012 | Reply
I forgot to mention that my observations were on 11.2.0.2.

Comment by tonyhasler — August 9, 2012 @ 2:21 pm BST Aug 9,2012 | Reply
Hi. jonathan.
i have question about TCH.
basically, I think that buffer is moved to the MRU then its touch count is reset to 0.
only if _db_aging_stay_count >= _db_aging_hot_crieria then touch count is set to the current touch count divied by two.

so, ficture 5-5. tch 4-> tch 2 is right ?

Comment by sean_kim — October 2, 2012 @ 12:42 am BST Oct 2,2012 | Reply
- Sean Kim,
  
  Thanks for the comment.
  
  I know I’ve seen at least one article saying that the TCH drops to zero when a buffer is promoted to the MRU end of the list. I also know that the tests I did shortly after reading the article (playing very carefully with a small buffer cache and watching x$bh) showed the TCH halving. I haven’t repeated the tests on a very recent version of Oracle, so I don’t know the current state of play – and I could be wrong. I do know that I was puzzled by both possibilities given the way the TCH changed to 1 as the buffer hit the midpoint anyway.
  
  It’s possible that Oracle corp. had been using all those hidden parameters to adjust the rate at which blocks were allowed to drop off the cache, and there are version dependencies that make a difference. Your observation about the relationship between those two parameters is an interesting insight into how convoluted the mechanisms may be. One day I may find the time to work through the same tests on the latest versions.
  
  Comment by Jonathan Lewis — December 31, 2012 @ 11:50 am GMT Dec 31,2012 | Reply
Hi Jonathan,
Many thanks for such an excellent book. As an Oracle newbie, I have difficulty understanding the bitmap segment space management of ASSM tablespace you explain under “Reducing Table Contention(ASSM)” in chapter5. Several searches I have made show the internal structure of ASSM blocks as organised in the form of a tree consisting of the segment header and bitmap block levels(you make mention of bitmap levels 1 and 2 in page99) which really baffles me.

Doing it in your inimitable style, I will be grateful for explanations taking into consideration the following:
1. the allocation and internal structure of ASSM blocks
2. the levels of bitmap blocks
3. the bits used in determining free space in a data block.

I also appreciate any links to resources that can help my better understanding on the above issues. Many thanks in advance.

Comment by Ernest — October 18, 2012 @ 3:25 am BST Oct 18,2012 | Reply
- Ernest,
  
  Your question requires a very lengthy answer, which I don’t have time to supply here. But a short note on the “tree” structure might help.
  
  Level 1 bitmap blocks contains a list of consecutive blocks in the segment with an approximate measure of how used they are; usage is “unformatteed, 25% used, 50% used, 75% used, full”. (It’s possible that level 1 bitmap blocks also list themselves (and the other space management blocks) with a status like “system”, but I’ll have to check that). At the most extreme case I believe a single level 1 bitmap block can list 1024 consecutive blocks in the segment.
  
  Level 2 bitmap blocks contain a list of level 1 bitmap blocks with a rough indication of how much free space each level 1 bitmap block is (indirectly) reporting as available.
  
  Level 3 bitmap blocks contain a list of level 2 bitmap blocks with a rough indication of how much free space each level 2 bitmap block is (doubly indirectly) reporting as available. The segment header block has an alternative identity as a level 3 bitmap block, and can list multiple level 3 bitmap blocks, and pointers to the next “most appropriate” level 3, 2, or 1 bitmap block to check when a session needs to search for space.
  
  The way in which level 3 bitmaps point to level 2 bitmaps which point to level 1 bitmaps gives you the image of the tree-like structure that is occasionally used in explanations of these segment space management blocks.
  
  There are a few articles on the Oak Table feed about ASSM, some of them may be of interest (some are from my own blog, of course). Otherwise I think Christian Antognini has some good stuff in his book – but a quick search on Google suggests that most of the information available is either trivial, out of date, copied from Oracle publications, wrong, or descriptions of exotic bugs.
  
  Here’s something I wrote on the OTN forums a few years ago that might also be of interest: https://forums.oracle.com/forums/message.jspa?messageID=3151619#3151619 though I’m not certain about the comment about linked lists of L3 blocks, and I’ve been told of an anomaly where parallel direct inserts can result in lots of L3 blocks appearing in a single (fairly small) segment.
  
  Comment by Jonathan Lewis — October 18, 2012 @ 9:48 am BST Oct 18,2012 | Reply
  - Hi Jonathan,
    Thanks for the excellent response. That together with the sources you recommended has helped a lot. However, I still have some dim areas I feel you can only make clear.
    1. About the Segment header block. I understand in MSSM it contains information about the pointers to the various freelists. If that is right what other information does it contain? And what is its role in ASSM? Also some sources show it as the first block of the first extent when an object is created which is subject to change as DML activities occur on the object. In ASSM it is often depicted as the 3rd block with the first 2blocks as L1 and L2 bitmap blocks respectively. I greatly appreciate your explanation on its location in both MSSM and ASSM blocks.
    2. What is transaction freelist? I can get any clear-cut explanation on that. If it is the same as Interested Transaction List then your response to only question 1 is appreciated.
    
    I grateful for your time in spite of your tight schedule. Thanks in advance.
    
    Comment by Ernest — November 8, 2012 @ 10:02 am GMT Nov 8,2012 | Reply
    - Ernest,
      
      1) In both ASSM and freelist management (MSSM) the segment header holds (amongst other bits) the extent map for the segment – this is the list of extents that belongs to the segment and, for locally managed tablespaces, is what allows Oracle to avoid recording extents in the dictionary tables fet$ (free extents) and uet$ (used extents). The ASSM equivalent of the freelist pointers would (very roughly) be a map of the various space management bitmap blocks, with hints about which is the next most desirable 1st, 2nd, or 3rd level bitmap to examine for free space.
      
      2) Freelists come in three flavours: master freelist (one per freelist group block), segment freelists (as many as specified in the object storage parameter freelists and transaction freelists (up to a fairly large number, constrainted by the block size, number of segment freelists, and whether or not you’re using freelist group blocks: it could be as high as 307 for an 8KB block size).
      
      When a session makes enough free space in a block for it to be reusable (i.e. it crosses the pctused boundary) the session picks one of the transaction freelists and attaches the block to it. (I won’t go into the mess of what happens if the session rolls back.)
      
      When a session needs a free block, it uses a hashing function based on the process id to select which segment free list it should be using, and takes a block from the top of that list. If there are no blocks on that segment freelist the session takes 5 blocks from the master freelist and attaches them to the segment free list; if there are no blocks on the master freelist the session sweeps the transaction freelists (for committed transactions) and moves their blocks to the master freelist; if this step doesn’t give it any free blocks the session bumps the segment high water mark (which may mean adding an extent) and formats 5 new blocks that can then be attached to its segment freelist.
      
      Comment by Jonathan Lewis — December 31, 2012 @ 11:37 am GMT Dec 31,2012
Hello Jonathan,

To the end of page 108, when discussing the mechanisms to move buffers from the REPL_MAIN to the REPL_AUX list, you write “I would assume that a buffer will only be transferred if the reader can get the relevant cache buffer chains latch in immediate mode…”.
Why is it necessary to get hold of the cache buffer chains latch in order to transfer a buffer from REPL_MAIN to REPL_AUX? As far as I understood, the buffer content is not changed by this operation (i.e. the buffer still holds the same block after it has been moved to REPL_AUX). I would therefore expect the buffer to remain attached to the same hash bucket and I thought that the cache buffer chain latches only purpose was to protect the hash chains against concurrent addition/removal of buffers.

thanks a lot
kind regards
Martin

Comment by Martin Maletinsky — January 8, 2013 @ 10:49 pm GMT Jan 8,2013 | Reply
- Martin,
  
  Your comment makes sense, and right now I can’t think of the reason that prompted me to make that comment – I should have explained it when I wrote it.
  
  Possibly it was the thought that the transfer shouldn’t take place if the block was pinned (because it’s in use) or – and this is the point of the immediate bit of the latch get – if another session is doing a “consistent get – examination”.
  
  Update (March): It’s amazing how easy it is to head off in the wrong direction because of the way a question is asked. I fooled myself into a state of confusion because Martin pointed out that “the buffer was not changed” – but the cache buffers chains latch doesn’t act to protect the BUFFER, it acts to protect the buffer HEADER. So any change to the buffer header (including the change to the repl_main / repl_aux linked list pointers) has to be protected by the cache buffers chains latch.
  
  Comment by Jonathan Lewis — January 12, 2013 @ 12:27 pm GMT Jan 12,2013 | Reply
Hello Jonathan,

On page 110 you write:
“If I wanted to superimpose the replacement lists on Figure 5-6, they would appear as two tangled lines weaving their way randomly through the buffer headers, obeying only the rules that (a) every buffer header would be on one of the lines …”.

Is that really correct? You may have buffers that are on neither of the REPL_MAIN, REPL_AUX chains (otherwise the fileds CNUM_SET, CNUM_REPL and CNUM_AUX in x$kcbwds would be redundant). This can happen e.g. in case buffers are moved from the REPL_ lists to the WRITE_MAIN list as described in chapter 6, page 144ff – If I understand the mechanisms correctly, those buffers might (will?) remain on their hash chain (?). This would however violate the claim “every buffer header would be on one of the lines”.

thank you
kind regards
Martin

Comment by Martin Maletinsky — January 9, 2013 @ 10:06 pm GMT Jan 9,2013 | Reply
- Martin,
  
  You’re correct, the statement is fundamentally wrong. It’s one of the “simplifications” I present all the way through the book, but it’s one that I should have stated was a simplification and supplied with a forward reference to the write lists et. al.
  
  Comment by Jonathan Lewis — January 12, 2013 @ 12:54 pm GMT Jan 12,2013 | Reply
Hello Jonathan,

When reading your comment on consistent gets – examination on page 114, two questions came into my mind. I have a hypothesis on the answer and I’d appreciate if you could confirm my hypothesis or tell me why I am wrong.
i) What qualifies a logical read to be done at “half-price”, i.e. with only one latch get and without pinning the buffer?
ii) How does Oracle ensure there is no concurrent modification of the buffer while it is being read with a “half-price” logical IO, i.e. not protected by a pin? Concurrent modification of the buffer while it is being read might provide the reader with an inconsistent view of the buffer content (it might see only part of the modifications made to the buffer).

I believe the answer to the two questions is as follows, but I haven’t verified it with any tests:
When Oracle expects that inspection of the buffer content can be completed within a short time, it does this inspection under protection of the cache buffers chains latch protecting the hash chain the buffer is currently linked on. I.e. after finding the buffer and while holding the cache buffers chains latch, Oracle checks that no exclusive pin is attached to the buffer. If this check succeeds, Oracle will inspect the content of the buffer (while still holding the cache buffers chains latch) and release the cache buffers chains latch, once the inspection of the buffer content is completed.
A separate code path must exist to handle the case where the buffer is pinned exclusively and thus may be subject to ongoing modifications – it doesn’t seem unreasonable to me to assume that Oracle will fall back to the ordinary logical IO in this case, ie. attach a shared pin to the waiters list of the buffer and wait for the exclusive pin to be released.

thank you for reviewing this statement
kind regards
Martin

Comment by Martin Maletinsky — January 22, 2013 @ 8:58 pm GMT Jan 22,2013 | Reply
- Martin,
  I think you’ve covered most of the points.
  
  There is also the detail that the examination is attempted only if Oracle thinks that it is likely to be completed in a short time. For example when accessing an undo block to find an undo record, evey detail about where that record will be is fixed and there’s no chance Oracle will have to do the equivalent of something like a block cleanout to find the right version of the record.
  
  Essentially, it probably comes down to a trade-off between the time it takes to create, attach and subsequently release a pin, compared to the time it will probably take to locate and extract the correct piece of information from the block.
  
  Comment by Jonathan Lewis — January 28, 2013 @ 6:21 pm GMT Jan 28,2013 | Reply
Hello Jonathan,

On page 115 you mention that Oracle does “switch current to new buffer” and you mention this mechanism again on page 116. Could you please briefly sketch what this mechanism is about or give a link where it is documented?

thanks a lot
kind regards
Martin

Comment by Martin Maletinsky — January 22, 2013 @ 9:04 pm GMT Jan 22,2013 | Reply
- Martin,
  
  I don’t think there’s a lot to be said about this mechanism.
  
  Basically a session may clone a CUR buffer before modifying it. The previous CUR copy is switched to CR as this takes place. The easiest way to see this is create a table with a few hundred block then update a few rows in the table by tablescan; each time you do the update you’ll see that the blocks you’ve updated are cloned before the update is applied.
  
  Comment by Jonathan Lewis — January 28, 2013 @ 6:33 pm GMT Jan 28,2013 | Reply
  - Hello Jonathan,
    
    Thank you for your reply. What is the benefit of this processing?
    You need to find an available buffer, link it into the appropriate hash buffer chain, do the actual cloning (memory copy), and potentially exchange the old buffer on the checkpoint queue with the new current buffer (as described in chapter 6, page 140) – i.e. quite some additional work. On top of it this makes the hash buffer chains longer (because there is an additional copy of the same data block, which might never be used again) and it requires additional space in the buffer cache.
    On the plus side I only see that other sessions that need a CR copy with a target SCN prior to the modification can access the old buffer, rather than having to create a CR copy by themselves. Is this the entire justification for the “switch current to new buffer” mechanism or are there any additional benefits?
    
    thank you
    kind regards
    Martin
    
    Comment by Martin — March 16, 2013 @ 10:49 am GMT Mar 16,2013 | Reply
Hello,

I found the note on top of page 115 a bit short and didn’t fully understand what it was saying. When browsing Internet to find some clarification I came across the following blog that Jonathan has written and which helped me to fully understand the note. I just post the link here in the hope it may be useful to other readers.
https://web.archive.org/web/20200509033755/www.jlcomp.demon.co.uk/buffer_handles.html

kind regards
Martin

Comment by Martin Maletinsky — January 23, 2013 @ 7:16 pm GMT Jan 23,2013 | Reply
- Martin,
  
  Thanks for the reference.
  
  Comment by Jonathan Lewis — January 28, 2013 @ 6:34 pm GMT Jan 28,2013 | Reply
Hello Jonathan,

I have some questions regarding the “Tablescans” section at the end of chapter 5.

When you describe the 8i and 9i strategy for “small tables”, you say the reads during scans of small tables were treated just as other reads but the touch count was not incremented when reading the blocks (nor during subsequent visits). This leads me to the following questions:
i) Is a newly read buffer linked at the midpoint into REPL_MAIN (i.e. at the location pointed to by cold_hd) as with other reads?
ii) In the “LRU/TCH” section starting on page 104 you don’t mention that the touch count is incremented for blocks that are (normally) read into the cache and linked at the midpoint into REPL_MAIN. Rather the touch count is set to 1 for these blocks. So what does it mean when you say that the touch count is not incremented for blocks read during a full scan of a small table – is the touch count set to 0?
iii) How does Oracle ensure that the touch count of blocks read during a full scan of a small table is not incremented during subsequent visits? Is there a flag on the buffer preventing such increments or is it maybe the touch count value 0 that indicates the block was read in by a full scan of a small table (see my guess in question ii).

vi) When you write “…the table was larger than the 2 percent limit, then the buffers used to read the table were immediately dropped to the end of the LRU list..”, does “end of the LRU list” mean REPL_AUX in this context, or the LRU end of REPL_MAIN?

thank you
kind regards
Martin

Comment by Martin Maletinsky — January 23, 2013 @ 9:50 pm GMT Jan 23,2013 | Reply
- Martin,
  
  i) & ii) I had to go back and read some old test scripts to check – and they say that the tch is not set (i.e. starts at zero) in 8i and 9i for tables less than 2% of the cache size, and that the blocks were loaded to the midpoint.
  
  iiI) There is a bitmapped FLAGS column on the buffer header, and bit 19 is named “only_sequential_read”. This is set if the table was subject to a full tablescan. I have always assumed that this is what stopped subsequent buffer accesses from incrementing the tch.
  
  iv) good question – it got a lot more complicated in recent versions – I suspect in the older versions when we said “dropped to the end of the LRU” we should have been saying “transferred to the repl_aux”. Have you read the investigation into “small” tables I did a couple of years ago ? https://jonathanlewis.wordpress.com/2011/03/24/small-tables/
  
  Comment by Jonathan Lewis — January 28, 2013 @ 7:54 pm GMT Jan 28,2013 | Reply
Jonathan,

Chapter 5 is one of my favorite from your book (with 2 and 3), there is couple of very interesting useful (often even amazing) details.
Please let me ask one additional question regarding Shared vs. Exclusive “cache buffers chain” latches. Surprisingly, oracle seems to grab exclusive latch when buffer is pinned (and Shared CBC latch for single-get visits), and still don’t have satisfactory explanation for that. Why is not adding shared pin enough?

Comment by Pavol Babel — October 8, 2013 @ 12:16 am BST Oct 8,2013 | Reply
- … I mean when buffer is pinned, exclusive CBC latch is grabbed. Can’t find reason why it is not enough to get shared CBC latch + add shared pin to buffer header…
  
  Comment by Pavol Babel — October 8, 2013 @ 12:37 am BST Oct 8,2013 | Reply
- Pavol,
  
  If you want to pin a buffer you’re changing the content of the chain – so you have to take it in exclusive rather than shared mode otherwise someone else could take it in shared mode and attempt to pin the same buffer at the same time, and that would allow them to mess up the doubly-linked list of buffer pins.
  
  Comment by Jonathan Lewis — October 11, 2013 @ 8:05 pm BST Oct 11,2013 | Reply
  - Jonathan,
    
    could you go into little deeper details please? When pinning, we need to grab buffer handle (hopefully we have some reserved), and change pointers US_NXT and US_PRV inside buffer header to point to new pin.
    So it is true we are changing content of the buffer chain (changing content of buffer header), but we are not relinking (deleting / addding buffer from/to chain) it.
    
    Comment by Pavol Babel — October 11, 2013 @ 9:31 pm BST Oct 11,2013 | Reply
  - Jonathan,
    
    of course you can ignore my previous post ;) I think I get the point now. Although it is not possible to get two shared latches on same buffer chain at the same time (one latch will be get always first, since latch waiter list has to be maintained for future possible exclusive waiters), both processes could have easily met on buffer pins list of the same buffer header.
    In other words, the buffer pin list of each buffer header is protected only by CBC latch, so exclusive CBC is definitely needed.
    
    Comment by Pavol Babel — October 11, 2013 @ 9:42 pm BST Oct 11,2013 | Reply
    - Pavol,
      
      You’ve got it.
      
      Comment by Jonathan Lewis — October 12, 2013 @ 8:39 am BST Oct 12,2013
Hello Jonathan,

I am faced with a performance problem, where the database spents > 50% in “latch: cache buffers chains” (with constant P1 value, ie. blocked on the same latch all the time).

I noticed the section “Top Latches” in the ASH report, which reads the following
```
Latch                            :"latch: cache buffers chains"
% Latch                          : 84.91
Blocking Sid(Inst)               : "Held Shared"
% Activity                       : 55.93
Max Sampled Wait secs            : 0.26
# Waits Sampled                  : 7,241
# Sampled Wts  1s                : 0
```
The number 84.91 is also displayed in “Top User Events” for “latch: cache buffers chains” under “% Event” so I guess it is just the contribution of “latch: cache buffers chains” to the overall database activity.

Unfortunately there is no detailed documentation for the ASH report and I am unsure how to interpret the columns “Blocking Sid(Inst)” and “%Activity”. Based on what I learnt in chapter 5 of your Book “Oracle Core” (and this is why I post this question here) I believe it might mean that 55.93% of overall database activity is spent by sessions sleeping on the wait list for the latch and needing to get the latch in exclusive mode (as the blocking SID is indicated by “Held Shared”). Could you please confirm if my interpretation of the ASH report is correct or tell me where I am wrong?

thank you
kind regards
Martin

P.S: Just to give a more complete picture, please find the content of the “Top Event P1/P2/P3 Values” section from the same ASH report:
```
Event                            : "latch: cache buffers chains"
% Event                          : 84.91
P1 Value, P2 Value, P3 Value     : "504404609763375632","228","0"
% Activity                       : 84.90 
Parameter 1                      : address
Parameter 2                      : number 
Parameter 3                      : tries
```
Comment by Martin Maletinsky — September 2, 2019 @ 5:17 pm BST Sep 2,2019 | Reply
- Martin,
  
  I can’t find an ASH report with any Top Latches information in my files, and I don’t recall looking at one when latching has been a problem.
  I tend to use the ASH report only for the “Activity over Time”, and usually run various queries against v$active_session_history if I want to dig deeper (I often ask clients to dump v$active_session_history into a table when there’s a live problem so that I can examine it later).
  
  Looking at the Top latches – I would guess that the 84,91% is the percentage of the rows that reports “latch: cache buffers chains”, and the 55.93 was the percentage of rows where there was a blocker that was also waiting, while the 84.91 vs. 84.90 in the Top Event report is letting you know that almost all the “latch: cache buffers chains” waits were for that one latch on the first sleep.
  
  Two things to follow up on then
  a) you have the child latch address in p1, so querying x$bh for file and block information where hladdr = {that address} will tell you which buffers are associated with that cache buffers chain. This may give you a clue about where and why the problem exists.
  b) a latch get where the competition is “Held shared” looks like an attempt to get exclusive – which would probably be following by a buffer busy wait on the relevant buffers, so checking on segment_statistics for buffer busy waits may give you another pointer.
  
  Regards
  Jonathan Lewis
  
  Comment by Jonathan Lewis — September 2, 2019 @ 6:11 pm BST Sep 2,2019 | Reply
  - Hello Jonathan,
    
    Thank you very much for your reply. I was also surprised to see this section “Top Latches” when I did browse through the ASH report today, so the section might have appeared in a newer Oracle release? It is mentioned here: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/tdppt/resolving-transient-performance-problems.html#GUID-C197336F-80B6-4F05-8EAD-D3948D79ACF6 but it is unfortunately the Oracle style documentation, i.e. basically just mentioning that this section exists.
    
    Thank you also for the hints where to search further. Unfortunately I am analyzing a problem which occurred twice in the past (in July 2019) and I don’t have the possibility to query any v$ or x$ views for this purpose. I do however have exports of DBA_HIST_ACTIVE_SESS_HISTORY of the time span when the problem occurred, so I might try to find further information there.
    
    thanks again
    kind regards
    Martin
    
    Comment by Martin Maletinsky — September 2, 2019 @ 9:22 pm BST Sep 2,2019 | Reply
    - Martin,
      
      I should have been clearer with my wording. The section exists in the reports I have, but in every one of them it says something like “there is no data available for this report”.
      
      You may find some clues in the dba_hist_active_session_history – especailly since such a large fraction of the events were for a single latch, but I’m always a little cautious with the dba_hist tables because the number of samples is so much smaller and therefore statistically lest trustworthy.
      
      Regards
      Jonathan Lewis
      
      Comment by Jonathan Lewis — September 2, 2019 @ 10:16 pm BST Sep 2,2019
  - Hello Jonathan,
    
    Thank you again for your reply.
    
    I have a question regarding b) in your first answer. If a session was to place a shared pin on the buffer, wouldn’t this require to take the corresponding cache buffers chains latch in exclusive mode first, without resulting in a buffer busy wait?
    
    I’ll look in the DBA_HIST_ACTIVE_SESSION_HISTORY. As the incident did span over a couple of hours, I hope that this should be statistically OK? Also I have AWR reports for every 15 minutes covering the entire problem period and I’ll check statistics there to see if I find anything of help.
    
    thank you very much for your responses and your hints.
    
    kind regards
    Martin
    
    Comment by Anonymous — September 3, 2019 @ 6:25 am BST Sep 3,2019 | Reply
Hello Jonathan,

I did intuitively assume until recently, that the cache buffer hash chains are per buffer pool, i.e. that one chain would always contain blocks from one pool only (eg. DEFAULT, KEEP or RECYCLE) and my (unfounded) assumption was further, that one cache buffers chains latch would always protect hash chains containing blocks from one pool only.

From your answer to Berusadla I understand however that this is not the case and also a small test I did points in this direction (I watched X$BH.HLADDR for blocks from a segment when changing the segments storage clause from DEFAULT to KEEP and it did not change).

Wouldn’t it be beneficial for performance however, to have exclusive hash chains dedicated to blocks in the KEEP pool? With a properly sized KEEP pool I would assume that these blocks would only rarely be unlinked from their hash chains after being read from disk and thus there would be little exclusive requests for the corresponding cache buffers chains latches once the cache is filled. This would reduce contention on those latches and therefore (on average) make logical reads cheaper for blocks from the KEEP pool (which typically are blocks that you would expect to be more frequently accessed).

Is there any obvious reason why Oracle Corp. has not implemented the buffer cache and the hash chains in this way or do you think that the possible benefit would just not outweigh the additional complexity?

Comment by Martin Maletinsky — May 5, 2020 @ 3:32 pm BST May 5,2020 | Reply
- Martin,
  
  Wouldn’t it be beneficial for performance however, to have exclusive hash chains dedicated to blocks in the KEEP pool?
  
  Lots of low-level details may have changed in the 9 years since I wrote the book, so any comment I make could well be incomplete or out of date. My first thought is that I’ve always pointed out that the KEEP pool is where you might put objects that you use quite frequently but not frequently enough to guarantee survival in the buffer. If you’ve got an object you expecte to access more frequently then it’s going to survive in the default pool.
  
  The other key point is that you still have to have worry about cloning blocks to make read-consistent copies; and that might happen far more frequently than any reading you do; similarly to pin a block you have to guarantee exclusive access to the specific buffer header temporarily and this might require an exclusive latch get (though possibly not if the buffer headers now have mutexes that could be used while the latch is held shared – to be investigated, it wasn’t in the book).
  
  Bottom line – I’m not sure that you could guarantee that the load on “KEEP-only” latches would be any different from uniformly distibributed, and if you isolated KEEP objects to a subset of latches you then have to decide at startup how many latches to assign to the KEEP pool – which you might do based on the size of the pool, but you’d then have to worry about dynamic resizing requiring dynamic reallocation of latches ! You may recall that Oracle went through at least a couple of significant side-effects from introducing the KEEP and RECYCLE pool (in much the same way that introducing sub-pools in the shared pool introduced new cases of a problem that the mechanism was supposed to fix). If nothing else, your final probably covers the key point – why do something complicated when you can’t be certain that it will be beneficial and when it might even make things worse (even if you managed to get all the code right the first time around).
  
  Regards
  Jonathan Lewis
  
  Comment by Jonathan Lewis — May 20, 2020 @ 10:22 am BST May 20,2020 | Reply
Hello Jonathan,

Re-reading chapter 5 I started thinking about possible reasons why the touch count of a buffer is only incremented after a 3 second delay (i.e. bursts of activity during a short period will not be reflected in the touch count).
My first thought was that the intention here was to avoid contention on latches or pins. However there seems no evident reason why the touch count incrementation should be protected by any latch or pin (in the worst case you miss an increment, but this will not change things dramatically). I found confirmation of the fact that no synchronisation device is used to protect the touch count adjustment in the following article which states “Oracle updates a buffer’s touch count without a latch”:

Click to access 1142482749257.pdf

The article also gives an explanation on the 3 second wait time “This “bursty” kind of buffer activity would reek havoc in a touch count based algorithm. So reduce this problem, …”. However I don’t really see what could actually be the problem with a buffer reaching a very high touch count due to bursty activity.

So my initial question why the three second interval for increments was choosen remains unanswered and I would like to ask you for your opinion? One possible explanation I can think of is that in a multi-CPU environment Oracle Corp. wants to avoid many CPUs writing to the same memory location repeatedly and simultaneously for very popular buffers, to avoid some congestion on the hardware level (memory bus, …) – but I have to admit that I reach the limit of my knowledge here and I am unable to say if this might really be the reason.

thank you very much
kind regards
Martin

Comment by Martin Maletinsky — May 7, 2020 @ 11:16 am BST May 7,2020 | Reply
Martin,

I think the comment that “Oracle updates a buffer’s touch count without a latch” may be one of those statements that is a misinterpretation of something a developer may have said in an explanation to a client-facing manager. It’s not exactly true and not quite false – if my understanding is correct, viz:

The only time you would want to increment the touch count is when you do a GET.
If you’re doing a GET you’re already holding the cache buffers chain latch.
So you don’t need an <b<extra latch get to increment the touch count, but you increment the touch count only if you are holding an appropriate latch.

I can imagine a developer explaining that “you don’t need to acquire a special latch to increment the touch count” and this turning into “you increment the touch count without acquiring a latch”.

As far as 3 seconds is concerned – I think the key requirement the need for a way of saying “this block is not popular”. A touch count of 1 is a good way of saying “not popular”, but if you increment the touch count on every get then a single execution of a single, simple, one-off, query could leave several blocks with touch counts higher of 3 or 4. Then you can’t tell the difference between a block which was touched 4 times in a millisecond and a blocks that was touched 4 times over the course of an hour. Every three seconds is probably fairly arbitrarily chosen as “if there’s a 3 second gap then there’s a reasonable chance that it was two separate executions”.

I’ll point out that there are several parameters to tweak the way in which the way the touch count is affected, including:

_db_aging_hot_criteria – touch count that makes a buffer “hot” (when it reaches the end of the LRU chain)

_db_aging_touch_time – minimum gap between touch count increments.

It’s quite possibly Oracle Corp. ran a few of their application benchmarks through a system with different values of these parameters to get an idea of a good set of default values.

Regards
Jonathan Lewis

Comment by Jonathan Lewis — May 20, 2020 @ 11:02 am BST May 20,2020 | Reply