Oracle Scratchpad

September 20, 2007

Hit Ratios (3)

Filed under: Statspack, Troubleshooting — Jonathan Lewis @ 11:56 am UTC Sep 20,2007

[Forward to next note on Hit Ratios]

 A little while ago I wrote some notes on why monitoring the buffer cache hit ratio (BCHR) was a pointless exercise. I’ve recently come across a lovely example demonstrating the point made by Connor McDonald in his ‘pick a hit ratio’ script.

After starting up Statspack on a system which had “a performance problem in the overnight runs” I picked up the following from a one-hour snapshot during the critical time-period.


Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            Buffer Nowait %:  100.00       Redo NoWait %:    100.00
            Buffer  Hit   %:   99.40    In-memory Sort %:     99.97                


Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                                     % Total
Event                                               Waits    Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time                                                        5,223    26.59
db file sequential read                         1,761,467       4,791    24.39
db file scattered read                            169,915       3,427    17.45               


                                                           Free    Write  Buffer
     Number of Cache      Buffer    Physical   Physical  Buffer Complete    Busy
P      Buffers Hit %        Gets       Reads     Writes   Waits    Waits   Waits
--- ---------- ----- ----------- ----------- ---------- ------- --------  ------
D      100,000  99.4 757,626,201   4,570,027  1,062,747       0       67   6,514
          -------------------------------------------------------------                

 
                                                    CPU      Elapsd
  Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
    270,710,940          141    1,919,935.7   35.7   851.67    951.05 3004918185
{a nasty little procedure call}

Notice how “nice” the buffer cache hit ratio is.

Notice, however, that 35% of the logical I/O could be attributed to just one pl/sql procedure that (I later learned) had once been quite efficient. It turned out that the procedure was redundant and should not have been running.

So pretend that code wasn’t running, and subtract 270 million from the logical I/O count: the adjusted buffer cache hit ratio would be 99.1% rather than 99.4%. 

This example highlights the flaw in the argument “we monitor the ratio over time and only look for problems if it changes significantly”.  How much of a change do you want to see before you start looking for a problem? 

Is 0.3% enough to make you worry? If not, why not?  (It happens that a drop in the ratio actually saved 851 CPU seconds - but at some point in the past the opposite effect would have taken place - would you have looked for a problem because the ratio had gone up by 0.3%) 

We come back to the same conclusion that we saw in Hit Ratios (2). Sometimes you will get false positives (and waste your time), sometimes you will not take action because a change in a ratio is insignifcant despite there being a significant change in the underlying events. Monitoring the underlying figures gives you some information, monitor the (just) ratio and you’re likely to hide the information.

Footnote: if you do a little arithmetic with the file wait times shown, you’ll get an average of 2.7 milliseconds for the single block reads and 20.1 milliseconds for the multi-block reads. So there’s probably a fair bit of help coming from the file system cache on the single-block reads, and some read-ahead help from the SAN on the multi-block reads. The figures are hinting at a problem with inefficient SQL rather than an inherent problem with the hardware or configuration.

[Back to previous note on Hit Ratios] [Forward to next note on Hit Ratios]

 

2 Comments »

  1. Excellent!

    Comment by Mirjana — September 21, 2007 @ 10:00 am UTC Sep 21,2007

  2. [...] Hit Ratios (4) Filed under: Troubleshooting — Jonathan Lewis @ 9:23 pm UTC Sep 26,2007 [Back to previous note on Hit Ratios] [...]

    Pingback by Hit Ratios (4) « Oracle Scratchpad — May 10, 2008 @ 10:43 am UTC May 10,2008

RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.