Hit Ratios (3)
[Forward to next note on Hit Ratios]
A little while ago I wrote some notes on why monitoring the buffer cache hit ratio (BCHR) was a pointless exercise. I’ve recently come across a lovely example demonstrating the point made by Connor McDonald in his ‘pick a hit ratio’ script.
After starting up Statspack on a system which had “a performance problem in the overnight runs” I picked up the following from a one-hour snapshot during the critical time-period.
Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 100.00 Redo NoWait %: 100.00 Buffer Hit %: 99.40 In-memory Sort %: 99.97Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- CPU time 5,223 26.59 db file sequential read 1,761,467 4,791 24.39 db file scattered read 169,915 3,427 17.45Free Write Buffer Number of Cache Buffer Physical Physical Buffer Complete Busy P Buffers Hit % Gets Reads Writes Waits Waits Waits --- ---------- ----- ----------- ----------- ---------- ------- -------- ------ D 100,000 99.4 757,626,201 4,570,027 1,062,747 0 67 6,514 -------------------------------------------------------------CPU Elapsd Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 270,710,940 141 1,919,935.7 35.7 851.67 951.05 3004918185 {a nasty little procedure call}
Notice how “nice” the buffer cache hit ratio is.
Notice, however, that 35% of the logical I/O could be attributed to just one pl/sql procedure that (I later learned) had once been quite efficient. It turned out that the procedure was redundant and should not have been running.
So pretend that code wasn’t running, and subtract 270 million from the logical I/O count: the adjusted buffer cache hit ratio would be 99.1% rather than 99.4%.
This example highlights the flaw in the argument “we monitor the ratio over time and only look for problems if it changes significantly”. How much of a change do you want to see before you start looking for a problem?
Is 0.3% enough to make you worry? If not, why not? (It happens that a drop in the ratio actually saved 851 CPU seconds - but at some point in the past the opposite effect would have taken place - would you have looked for a problem because the ratio had gone up by 0.3%)
We come back to the same conclusion that we saw in Hit Ratios (2). Sometimes you will get false positives (and waste your time), sometimes you will not take action because a change in a ratio is insignifcant despite there being a significant change in the underlying events. Monitoring the underlying figures gives you some information, monitor the (just) ratio and you’re likely to hide the information.
Footnote: if you do a little arithmetic with the file wait times shown, you’ll get an average of 2.7 milliseconds for the single block reads and 20.1 milliseconds for the multi-block reads. So there’s probably a fair bit of help coming from the file system cache on the single-block reads, and some read-ahead help from the SAN on the multi-block reads. The figures are hinting at a problem with inefficient SQL rather than an inherent problem with the hardware or configuration.
[Back to previous note on Hit Ratios] [Forward to next note on Hit Ratios]
Excellent!
Comment by Mirjana — September 21, 2007 @ 10:00 am UTC Sep 21,2007
[...] Hit Ratios (4) Filed under: Troubleshooting — Jonathan Lewis @ 9:23 pm UTC Sep 26,2007 [Back to previous note on Hit Ratios] [...]
Pingback by Hit Ratios (4) « Oracle Scratchpad — May 10, 2008 @ 10:43 am UTC May 10,2008