Oracle Scratchpad

January 21, 2011

Quiz Night

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 7:04 pm BST Jan 21,2011

Here’s an execution plan pulled from v$sql_plan in 10.2.0.3 with the “gather_plan_statistics” option enabled:

----------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Starts | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------
|   1 |  NESTED LOOPS                |        |      1 |      0 |00:00:00.01 |     608 |
|   2 |   TABLE ACCESS BY INDEX ROWID| PARENT |      1 |    200 |00:00:00.01 |       6 |
|   3 |    INDEX RANGE SCAN          | PAR_I1 |      1 |    200 |00:00:00.01 |       2 |
|   4 |   TABLE ACCESS BY INDEX ROWID| CHILD  |    200 |      0 |00:00:00.01 |     602 |
|   5 |    INDEX UNIQUE SCAN         | CHI_I1 |    200 |    200 |00:00:00.01 |     402 |
----------------------------------------------------------------------------------------

The code defining the two tables is the same:

create table &m_table_name
as
select
	rownum					id,
	trunc(dbms_random.value(1,&m_rowcount))	n1,
	lpad(rownum,10,'0')			small_vc,
	rpad('x',100)				padding
from
	big_row_source
where
	rownum <= &m_rowcount
;

The indexes par_i1 and chi_i1 are on the id column and, as you can see for chi_i1, have been created as unique indexes. Unlike the table, though, the code defining them isn’t necessarily identical.

Please state, with justification, what you think the blevel is for index chi_i1 ?

(Note – to avoid any risk of confusion, remember that height = blevel + 1; and just in case anyone thinks it makes any difference, the tables and indexes were all created in the same tablespace which is my usual setup of 1MB uniform extents, locally managed, 8K block size and freelist management.)

The quiz isn’t intended to be cunning, devious and tricksy, by the way. It’s all about observing and understanding a fairly ordinary situation. On the other hand I’m sure there are interesting variations with strange side effects that could be introduced if you really want to get clever.

Answer 23rd Jan:
Rather than writing my answer into a new post I’ve added it below – but to give new visitors a chance to work out the answer before seeing mine I’ve left a big gap before the answer.

Reading the comments I think all the interesting points have been covered. We have comments covering special treatment of root blocks in nested loop joins, the “buffer is pinned count” statistic, possible changes in 11g, the hypothesis that I had set a large pctfree for the child index,and the link between “A-rows” in one line of the plan and “Starts” in another line.

For reference, my query looked like this:

select
	par.small_vc,
	chi.small_vc
from
	parent	par,
	child	chi
where
	par.id between 301 and 500
and	chi.id = par.n1
and	chi.n1 = 0
;

The mechanics of the plan are:
Line 3 – index range scan of par_i1 (which finds ultimately 200 matching rowids).
Line 2 – for each rowid from line 3 check the matching table row (which ultimately returns 200 rows)

Then two hundred calls to:
Line 5 – index unique scan of chi_i1, which always finds it single row – hence returning 200 rowids in total
Line 4- check the matching row in the table – which always fails – hence returning zero rows in total

The critical counts that allow you to answer my question are the 402 and 602 buffer gets in lines 5 and 4. Ignoring, temporarily, the odd 2 these numbers are clearly 200 times something – which is rather nice given that we know that we are doing something 200 times. The simplest solution, of course is that the (602 – 402 =) 200 gets due to line 4 represents the visits to the 200 table blocks, leaving us with 400 (plus an odd bit) to account for.

Now consider the possibilities:

    Blevel = 0 – we would visit the only index block (which would be the root, but look like a leaf) then visit the table.
    Blevel = 1 – we would visit the root, the leaf which is the next layer down, then the table
    Blevel = 2 – we would visit the root, the branch level, the leaf, then the table.

So, at first sight we might decide the only way to get 400 buffer visits from the index is to have Blevel = 1, visit the root and the leaf. But that’s not the way it works (apart from a couple of versions which had a bug).

When running a nested loop join, Oracle “pins” the root block of the inner (second table) index, which means it attaches a structure to the “users’ list” in the buffer header that links the session’s state object directly to the buffer header. So, in my nested loop join, Oracle gets the root block once and keeps it pinnned, then gets branch and leaf blocks 200 times. The blevel on the child index is 2. (The blevel of the parent index is only 1 – I had set the pctfree on the child index to 90 to make it much larger than it needed to be.)

This “root block pin” isn’t the only pinning optimisation in the query, though. In detail, the steps are as follows – and this is something you would be table to see in the trace file if you enabled event 10200 (one of the ‘consistent reads’ traces).

    1) Get the root block of parent index and pin it
    2) Get the first relevant leaf block of the parent index and pin it
    3) Get the first relevant table block of the parent table and pin it.
    4) Get the root block of the child index and pin it
    5) Get the relevant branch block of the child index (no pin)
    6) Get the relevant leaf block of the child index (no pin)
    7) Get the table block of the child table (no pin) — row does not match final predicate and is rejected
    8) Revisit the parent index leaf block through the pin (add one to “buffer is pinned count”)
    9) Revisit the parent table block through the pin (add one to “buffer is pinned count”)
    10) Revisit the child index root block through the pin (add one to “buffer is pinned count”)
    11) Get a new child branch
    12) Get a new child leaf
    13) Get a new table block
    14) Repeat for a total of 200 cycles from step 8

There a couple of deviations from this cycle, of course. The rows I needed from parent were spread across 4 consecutive blocks in the table so, roughly every 50 rows from parent, line (9) above would become “release current pinned parent block, get new parent block and pin it”. The other little oddity that I can’t explain is that Oracle does “get” the child root block on the second visit to the index as well as the first visit – and then pins it from that moment onwards. So the counts are:

    Gets on the parent index 2 (root and leaf)
    Gets on the parent table 4 (one for each table block visited)
    Gets on the child index 402 (two on the root block, 200 on branch blocks, 200 on leaf blocks)
    Gets on the child table (200 for each row/block accessed)

Inevitably, things change – Oracle keeps getting smarter about things like ‘consistent gets – examination’, ‘buffer is pinned count’ and, in 11g, “fastpath” access to buffered blocks. This is a clue to the difference in gets that Charles Hooper recorded in 11.2.0.2 – and explains why I chose to use 10.2.0.3 with a unique index for my example. If you want to investigate other variations all it takes is snapshots of v$mystat, calls to “alter system flush buffer_cache”, and event 10200; the treedump can also be very helpful for identifying block addresses.

[Some notes which say a little more about gets and pins]

17 Comments »

  1. Jonathan,

    The BLEVEL of CHI_I1 index will be 2.

    Reading from explain plan, your query will fetch 200 rows from index PAR_I1 and reading PARENT TALBE won’t discard any of them. That’s why we have 200 reads on CHI_I1. And here is the interesting part of story beginning. If BLEVEL is 2, we have to read 3 blocks when perfroming UNIQUE SCAN (one root, one branch and one leaf). But in case of NESTED LOOPS, Oracle is clever enough not to read the ROOT BLOCK every time. It will cache it somehow (I have to admitt I’m not able to describe whole process), spending one extra consistent read. So we have 4 consistent reads on index CHI_I1 for the first row, but only 2 consistent reads on next 199. 4 + 2 * 199 = 4 + 398 = 402 (402 matches buffer gets on row 5 of your exececution plan). Then we need 1 another read from table CHILD, for every row. 402 + 200 = 602.

    The second interesting observetion is that only 2 reads were needed on index PAR_I1. As far I can tell, it is counsequeunce of your query, which is obviously not returning any rows and some clever prefeteching alghoritm. It would be really nice, if you would be able to provide some extra explanation.

    Comment by Pavol Babel — January 21, 2011 @ 9:32 pm BST Jan 21,2011 | Reply

    • Pavol,

      I was thinking on the same lines but I could not understand your following 2 statements
      If BLEVEL is 2, we have to read 3 blocks when perfroming UNIQUE SCAN (one root, one branch and one leaf) and So we have 4 consistent reads on index CHI_I1 for the first row
      Shouldn’t that be 3 consistent reads on index CHI_I1 for the first row instead? Or is the extra consistent get for reading index segment header?

      Comment by Narendra — January 22, 2011 @ 10:01 am BST Jan 22,2011 | Reply

      • Narednra,

        Oracle does not have to read index segment header in case of index unique or range scan (It think oracle is able to derive ROOT BLOCK address directly from address of segment header without inspecting segment header block).

        Now I will try to explain my statment once more time. I wrote: “It will cache it (the ROOT BLOCK) somehow (I have to admitt I’m not able to describe whole process), spending one extra consistent read .” . I think the extra consistent read is spent due ROOT BLOCK CACHING, but I’m not sure, just guessing Oracle will create new consistent read version of ROOT BLOCK in BUFFER CACHE, pinned by server process during nested loops join.

        Comment by Pavol Babel — January 22, 2011 @ 3:56 pm BST Jan 22,2011 | Reply

  2. Hi Joanathan,

    I think blevel will be 2, but the interesting thing is in the execution plan at step 5 which shows index unique scans and A-rows are 200 (i.e. instead of 1) since the index is define as unique or may be i am missing
    something

    Also my assumption is that query is joinning parent and child table and there will be range condition on parent.id which why there is index range scans i think it may be 200 range scan and for each row it will lookup in child table

    If my above assumption is right than is 200 is the cummulative total in step 5 for chi_i1?

    Thanks

    Comment by Henish — January 21, 2011 @ 9:53 pm BST Jan 21,2011 | Reply

  3. Jonathan,

    Without testing, I would suggest that the BLEVEL is 1 (I will leave the explanation for later, but basically every start should visit the root block).

    Just out of curiosity, is there an optimization for this in 11.2.0.2? My test results for the CHILD table (T2) differed from the explanation that I would have provided to explain the number of buffer visits (note that OPTIMIZER_FEATURES_ENABLE was set to 10.2.0.4 to avoid two NESTED LOOPS joins):

    SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST -ROWS -PREDICATE'));
     
    ----------------------------------------------------------------------------------------
    | Id  | Operation                    | Name   | Starts | A-Rows |   A-Time   | Buffers |
    ----------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT             |        |      1 |      0 |00:00:00.01 |     215 |
    |   1 |  NESTED LOOPS                |        |      1 |      0 |00:00:00.01 |     215 |
    |   2 |   TABLE ACCESS BY INDEX ROWID| T1     |      1 |    200 |00:00:00.01 |       6 |
    |   3 |    INDEX RANGE SCAN          | PAR_I1 |      1 |    200 |00:00:00.01 |       2 |
    |   4 |   TABLE ACCESS BY INDEX ROWID| T2     |    200 |      0 |00:00:00.01 |     209 |
    |   5 |    INDEX UNIQUE SCAN         | CHI_I1 |    200 |    200 |00:00:00.01 |       9 |
    ----------------------------------------------------------------------------------------
    

    Comment by Charles Hooper — January 21, 2011 @ 11:23 pm BST Jan 21,2011 | Reply

    • By testing on 10.2.0.5 I found that the number of consistent gets for the index access were much closer to those at the top of this blog article. However, that testing also demonstrated that the BLEVEL of the CHI_I1 index is not 1 as I suggested earlier. I still find it interesting that the number of consistent gets for the CHI_I1 index on 11.2.0.2 is significantly less than that achieved on 10.2.0.5.

      Comment by Charles Hooper — January 22, 2011 @ 3:46 am BST Jan 22,2011 | Reply

      • I think, this is effect of Buffer Pinning (related statistic “buffer is pinned count”). With reverse index, for example, Oracle must visit different leaf block in each iteration, therefore can’t get advantage of this optimisation.

        Comment by Vyacheslav Rasskazov — January 22, 2011 @ 5:11 am BST Jan 22,2011 | Reply

  4. Let’s see… We have either blevel 1 and two extra buffer gets to explain, or blevel 2 and 198 missing buffer gets. Pavol’s theory accounts nicely for 199 missing CRs, leaving only the question of why would Oracle need an extra CR to pin (or copy to private memory) the root block buffer? Unless this only happens on the second start of the unique scan.
    On the other hand, 2 extra CRs (hence a blevel 1) would be easy to explain with a concurrent change or delayed cleanouts if the “no tricks” premise didn’t stand in the way.

    Also, PAR_I1 obviously has a blevel of 1. Why would an index on the same column type need a second branch level? Either CHILD has many more rows, or a large PCTFREE was specified. So a blevel of 2 is still possible.

    Can’t wait for the solution…

    Comment by Flado — January 22, 2011 @ 2:08 pm BST Jan 22,2011 | Reply

  5. I will submit my very little oracle knowledge to critics

    We have the following information on line 5 of the explain plan

    Starts = 200, A-Rows = 200 and Buffers = 402

    The Operation 5 has been started 200 times and has produced 200 rows; which means that A-Rows/Starts = 200/200 = 1 row for each execution

    Buffers definition : The number of logical reads the operation performed during its executions = 402 almost 402/200 = 2 buffers

    Hence, the operation 5 performed 2 consistent gets for each CHI_I1 Index access; so its blevel = 0

    Best Regards

    Comment by Mohamed — January 23, 2011 @ 9:37 am BST Jan 23,2011 | Reply

  6. The answer is 2.
    I’ve added the explanation to the original posting.

    Comment by Jonathan Lewis — January 23, 2011 @ 2:15 pm BST Jan 23,2011 | Reply

  7. Jonathan,

    I see that you have already posted the answer to the quiz – very thought provoking.

    For comparison with 11.2.0.2, this is the execution plan that I see on 10.2.0.5 when the BLEVEL of both indexes is 1, as was the case on 11.2.0.2:

    SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST -ROWS -PREDICATE'));
     
    ----------------------------------------------------------------------------------------
    | Id  | Operation                    | Name   | Starts | A-Rows |   A-Time   | Buffers |
    ----------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT             |        |      1 |      0 |00:00:00.01 |     408 |
    |   1 |  NESTED LOOPS                |        |      1 |      0 |00:00:00.01 |     408 |
    |   2 |   TABLE ACCESS BY INDEX ROWID| T1     |      1 |    200 |00:00:00.01 |       6 |
    |   3 |    INDEX RANGE SCAN          | PAR_I1 |      1 |    200 |00:00:00.01 |       2 |
    |   4 |   TABLE ACCESS BY INDEX ROWID| T2     |    200 |      0 |00:00:00.01 |     402 |
    |   5 |    INDEX UNIQUE SCAN         | CHI_I1 |    200 |    200 |00:00:00.01 |     202 |
    ----------------------------------------------------------------------------------------
    

    As can be seen by the above, the number of consistent gets for the CHI_I1 was 202 rather than 402 as shown at the top of this blog article. Process of elimination would suggest that the BLEVEL must be 2 (of course process of elimination was not the desired path of finding the answer).

    With 300,000 rows in each of the tables I used the following SQL statement that resulted in the generation of the above execution plans:

    SELECT /*+ GATHER_PLAN_STATISTICS LEADING(T1) USE_NL(T2) INDEX(T1) INDEX(T2) */
      T1.ID,
      T1.N1,
      T2.ID,
      T2.N1
    FROM
      T1,
      T2
    WHERE
      T1.ID=T2.ID
      AND T1.ID BETWEEN 1 AND 200
      AND T2.N1 = 0;
    

    Vyacheslav Rasskazov’s comment jogged my memory a bit regarding what is likely happening. While waiting for your answer to this post I performed a couple of tests. Running this SQL statement before and after executing my test query:

    SELECT
      NAME,
      VALUE
    FROM
      V$MYSTAT M,
      V$STATNAME S
    WHERE
      (NAME LIKE 'consistent%'
        OR NAME LIKE 'buffer%')
      AND M.STATISTIC#=S.STATISTIC#;
     
    NAME                                        VALUE
    -------------------------------------- ----------
    consistent gets                               456  /* +408 */
    consistent gets from cache                    456  /* +408 */
    consistent gets - examination                 405  /* +402 */
    consistent gets direct                          0
    consistent changes                              0
    buffer is pinned count                        630  /* +593 */
    buffer is not pinned count                    421  /* +406 */
    

    The consistent gets count increase matches my execution plan, and the buffer is pinned count increased by 593.

    I then created a 10200 trace:

    ALTER SESSION SET TRACEFILE_IDENTIFIER='WATCH_CONSISTENT'; 
    ALTER SESSION SET EVENTS '10200 TRACE NAME CONTEXT FOREVER, LEVEL 1';
     
    SELECT /*+ GATHER_PLAN_STATISTICS LEADING(T1) USE_NL(T2) INDEX(T1) INDEX(T2) */
      T1.ID,
      T1.N1,
      T2.ID,
      T2.N1
    FROM
      T1,
      T2
    WHERE
      T1.ID=T2.ID
      AND T1.ID BETWEEN 1 AND 200
      AND T2.N1 = 0;
    

    And a simple Excel macro to process the trace file:

    Sub ReadTrace()
        Dim intFileNum As Integer             '10200 trace file
        Dim intFileNum2 As Integer            'Output file
        Dim strInput As String                'Line read from the 10200 trace file
        Dim strOutput As String               'Line to be written to the output file
        Dim strBlock(2000) As String          'Block read from the trace file
        Dim strBlockCounter(2000) As Integer  'Number of times read
        Dim intBlocks As Integer              'Total number of blocks
        Dim i As Integer                      'Loop counter
        Dim intFound As Integer               'Indicates whether or not the block was found
        
        intFileNum = FreeFile
        Open "c:\or10s_ora_4256_watch_consistent.trc" For Input As #intFileNum
        
        intFileNum2 = FreeFile
        Open "c:\watch_consistent.txt" For Output As #intFileNum2
        
        Do While Not EOF(intFileNum)
            Line Input #intFileNum, strInput
            If InStr(strInput, "Consistent read started for block") > 0 Then
                strOutput = Trim(Right(strInput, Len(strInput) - InStr(strInput, ":")))
                'Find the number of times the block was accessed
                intFound = 0
                For i = 1 To intBlocks
                    If strOutput = strBlock(i) Then
                        intFound = i
                        strBlockCounter(i) = strBlockCounter(i) + 1
                        Exit For
                    End If
                Next i
                'If the block was not found, record it
                If intFound = 0 Then
                    intBlocks = intBlocks + 1
                    intFound = intBlocks
                    strBlockCounter(intBlocks) = 1
                    strBlock(intBlocks) = strOutput
                End If
                Print #intFileNum2, strOutput; vbTab; strBlockCounter(intFound)
            End If
        Loop
        Print #intFileNum2, ""
        For i = 1 To intBlocks
            Print #intFileNum2, strBlock(i); vbTab; strBlockCounter(i)
        Next i
        Close #intFileNum
        Close #intFileNum2
    End Sub
    

    The output of the Excel macro lists every consistent block read, outputs that block number to another file, along with a running count of the number of times that block had been accessed. A summary is produced at the bottom of the output, listing the blocks accessed in order and the total number of consistent gets for that block. The summary from my test follows:

    0206e214	 1 
    0206e215	 1 
    01c0000c	 1 
    01c72d94	 2 
    01c72d95	 200 
    01c003ec	 57 
    01c0000d	 1 
    01c003ed	 57 
    01c0000e	 1 
    01c003ee	 57 
    01c0000f	 1 
    01c003ef	 29
    

    The read of blocks 0206e214 and 0206e215 were at the very start of the trace file. How are those block numbers from the 10200 trace file translated into the following from your explanation:

    1) Get the root block of parent index and pin it
    2) Get the first relevant leaf block of the parent index and pin it

    Comment by Charles Hooper — January 23, 2011 @ 3:55 pm BST Jan 23,2011 | Reply

    • Charles,

      Since you have blevel 1 on the child index you can ignore the comments about the branch block. Given that I know what my data looks like I know that you have 57 rows per table block in the parent and child tables. So your output shows:

      0206e214	   1 - get and pin parent index root
      0206e215	   1 - get and pin parent index child
      01c0000c	   1 - get and pin parent table block 1
      01c72d94	   2 - get child index root block twice
      01c72d95	 200 - get a child leaf block for 200 iterations of the loop
      01c003ec	  57 - get the first child table block 57 times
      01c0000d	   1 - get and pin the second parent table block
      01c003ed	  57 - get the second child block 57 times
      01c0000e	   1 - get and pin the third parent table block
      01c003ee	  57 - get the third child block 57 times
      01c0000f	   1 - get and pin the fourth parent table block
      01c003ef	  29 - get the fourth child block 29 times
      

      Note that: 57 + 57 + 57 + 29 = 200

      Comment by Jonathan Lewis — January 23, 2011 @ 5:51 pm BST Jan 23,2011 | Reply

    • Simplified version for UNIX:

      #!/bin/ksh
      
      INFILE=$1
      OUTFILE=$2
      
      grep "Consistent read started" $INFILE | awk -F ':' '{print $2}' | sort  | uniq -c | sort -rn > $OUTFILE
      

      Should do the same and work even on windows (powered by cygwin :) :) )

      Comment by Pavol Babel — October 6, 2011 @ 12:05 pm BST Oct 6,2011 | Reply

  8. Jonathan,

    Great job gain. I would like to two additional questions:

    1. I have never heard of 10200 event. How did you know such an event exists? But meanwhile, I have discovered $ORACLE_HOME/rdbms/mesg/oraus.msg. i didn’t know this file was providing also list of events and their names! :-)
    2. Very interesting information that Oracle is pinning blocks in the outer table, too (in 10g). If we give away condition “and chi.n1 = 0″ making your query retuning 200 rows, we will obviously have slightly more consistent gets in rows 2 and 3.

    The change of behaviour in 11g is very interesting, too.

    Comment by Pavol Babel — January 23, 2011 @ 4:21 pm BST Jan 23,2011 | Reply

    • Or in any oracle client:


      begin dbms_output.put_line(sqlerrm(-10200));end;
      /

      Here’s the output from “SQL Commands” at apex.oracle.com:

      ORA-10200: consistent read buffer status

      Statement processed.

      0.00 seconds

      Comment by Flado — January 23, 2011 @ 5:05 pm BST Jan 23,2011 | Reply

    • Pavol,

      My original source (some 15 to 20 years ago) was the oraus.msg file. From time to time you find notes on Metalink or on the internet about them.

      Another interesting little detail is how more and more “consistent gets” have gradually changed to “consistent gets – examinations” over the versions. Once upon a time it seemed to be only undo blocks, single table hash cluster blocks, and root blocks of indexes – in recent versions you can even see ordinary heap table blocks subject to examinations.

      Comment by Jonathan Lewis — January 23, 2011 @ 5:54 pm BST Jan 23,2011 | Reply

  9. [...] happened again, another blog article that forced me to stop, think, and … hey, why did Oracle Database 11.2.0.2 do something [...]

    Pingback by Watching Consistent Gets – 10200 Trace File Parser « Charles Hooper's Oracle Notes — January 24, 2011 @ 6:01 am BST Jan 24,2011 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,877 other followers