Oracle Scratchpad

September 5, 2007

Hit ratios (2)

Filed under: Performance,Ratios,Troubleshooting — Jonathan Lewis @ 8:33 pm BST Sep 5,2007

I thought that most people had managed to get over the buffer cache hit ratio (BCHR) as an aid to monitoring an Oracle system, so I was a little surprised to see a recent question on comp.databases.oracle.server about how to calculate it “properly” turning into a rather long thread – about 120 posts to date.

The consensus was that there was no meaningful way in which the BCHR could be used to monitor an Oracle system.

There was, however, one staunch defender of the BCHR. I have extracted all the arguments he presented over the course of the thread – slightly re-worded for compactness – and thought I would review his case. This, in summary, is what I believe he had to say:

  1.  In practice, you will observe it over a period of time.
  2. Why is BHCR meaningless? The answer should be short and simple.
  3. Try asking yourself the same questions about any other indicators you consider meaningful.
  4. Given everything else being equal, high BCHR is always better than low BHCR.
  5. BCHR alone does not tell you about overall performance. It simply tells you the percentage of reads from disk. It is an indicator, a very meaningful one.

1) In practice you will observe it over a period of time.

I’ve put this observation first because it’s important to know how a metric is being used before you can decide whether or not it is likely to be useful.

Unfortunately all we were told was that the value should be observed over time, with no further comment about what you might be looking for and what would prompt you to take any further action. But see point (2).

2) Why is BHCR meaningless? The answer should be short and simple.

The BCHR is meaningless because (a) if you make your code more efficient the BCHR may go up, go down, or stay the same; (b) if you make your code less efficient the BCHR may go up, go down, or stay the same; (c) if you don’t touch a single line of your code, a small change in data – with no change in execution plan – may make the BCHR go up or down (or stay the same).

So whatever you see when you look at the BCHR, whether it is smaller, the same, or bigger than last time, you have to look elsewhere to interpret the value and see if there is any reason for taking action. This means the BCHR is redundant – which makes it a meaningless indicator for monitoring purposes.

For a more wordy answer to this question, you could look at some material I published a few years ago (the article for the IOUG is the more serious one, the UKOUG one makes a serious point in a less than serious way).

3) Try asking yourself the same questions about any other indicators you consider meaningful.

This is a good point – you may discover that some of your other metrics are meaningless if you review them all. 

The suggestion, however, was made in response to someone quoting Connor McDonald’s “pick a hit ratio” script and was intended to discount Connor’s point by focusing on the fact that you can distort any metric if you try hard enough.

It’s true: every metric that is a ratio suffers from exactly the same problem – but  Connor’s point wasn’t about using dirty tricks to manipulate the metrics, it was about how some unlucky piece of code could make the metric completely useless.

If you want a real-life example of code that could “randomly” produce massive numbers of buffer visits and distort your BCHR in a “real” system, you need look no further than this old article of mine.

4) Given everything else being equal, high BCHR is always better than low BHCR.

Viewed literally this statement contradicts itself. If you re-run a fixed batch of tasks and ensure that everything else really is “being equal” then the BCHR can’t change because the BCHR id derived (in most versions of the formula) from three other things – and if they don’t change the BCHR can’t change.

So you have to ask how much variation you are going to allow in your environment or task in order to do “the same thing” but still see some change in the BCHR that you can judge to be better or worse. Which takes us back to point (2) – the workload and performance could be better, worse, or unchanged regardless of what happens to the BCHR.

Example 1 - in your snapshot period some job hits a problem, rolls back, and restarts. Rolling back may very well produce double the logical I/O of the original task – because of the way undo records are acquired and marked during rollback. Moreoever, the data that you accessed to make the original change may have been read from disc initially but still be in memory when you get to the rollback. When the job restarts, the required data is, again, much more likely to be in memory.  So failing, rolling back and restarting can increase your logical I/O with  no increase in the disc reads – which means your BCHR goes up while your job consumes more resources and slows down. (Don’t forget that rolling back typically produces a large fraction of the original volume of redo as well).

Example 2 – a big update in one stream that normally finishes just before a big query in another stream takes a few seconds longer than usual and the two overlap. Instead of doing the usual little bit of logical I/O with the undo segment header to sort out delayed block cleanout on some of the updated blocks, the query in the second stream now does a huge amount of work acquring undo records from the relevant undo segment so that it can make the updated (but not committed in time) blocks read-consistent. The BCHR probably goes up “nicely”, but the work load increases and performance goes down. (The actual change in the BCHR depends on the exact nature of the update and the query – it might go up, go down, or stay the same: the only thing you can be sure of is that the workload will increase whichever way it moves).

5) BCHR alone does not tell you about overall performance. It simply tells you the percentage of reads from disk. It is an indicator, a very meaningful one.

So the BCHR alone does not tell you about performance. Correct, it doesn’t – but if it’s “the BCHR plus X, Y and Z” telling you about performance, then it’s really “X, Y, and  Z” telling you about performance. The BCHR is redundant, and may actually persuade you that you don’t need to look at X, Y and Z.

Of course, if you do monitor the BCHR over time, you may well decide that a significant change in the BCHR is worth investigating and it may make you look at X, Y and Z and take appropriate action. Remember though, whilst this is better than not monitoring anything at all, what you’ve done was not a consequence of the BCHR being a meaningful indicator. Just think about:

  • the number of times you looked at X, Y, and Z when you didn’t need to – because the BCHR told you to.
  • the number of times you didn’t look at X, Y, and Z when you should have (not that you’ll know how many times that was, of course, because the BCHR wasn’t telling you to).

I suppose you could play the semantic game – the BCHR is meaningful because it has a definition,  so it means what its definition defines it to mean. But I think in ordinary conversation when we ask about “how meaningful” something is, we are asking for meaning beyond the simple definition.

But think about what the BCHR is apparently trying to tell us (and technically that should be be more like “100 – percentage of blocks that had to be loaded into Oracle’s cache before the data could be used.” Why is that a meaningful indicator by itself ? Remember the example I reported a little while ago where the performance impact was dependent not only on the number of “physical reads” but also – and, in fact, more signifcantly – on the disc response time.

Conclusion

Ratios are highly suspect for monitoring purposes. If you think a ratio is helpful, think carefully about whether you should be monitoring the two underlying values instead.

The buffer cache hit ratio (BCHR) is a particularly bad example of the genre as there are so many events that make the attempt to correlate BCHR and performance meaningless.

[Further reading on ratios]

11 Comments »

  1. As a practical example i will provide the following : I’ve been working on Oracle Applications, trying to optimize from day to day large “Specific” batches, by essentially focusing on lowering excessive logical I/Os. It happened that within 6 months, thanks to all efforts, overall response time has been divided by 2, and my BCHR dropped from 90% to 70% because of disk reads increase. did my system worked slower ? no, twice faster than previously, despite of BCHR decreasing ….

    Comment by Olivier Bernhard — September 10, 2007 @ 9:49 pm BST Sep 10,2007 | Reply

  2. Highly instructive Jonathan, thanks.
    One day we all might see the curvature of the horizon ;o)

    Comment by SeánMacGC — September 11, 2007 @ 1:54 pm BST Sep 11,2007 | Reply

  3. Excellent!

    Comment by Mirjana — September 12, 2007 @ 8:25 am BST Sep 12,2007 | Reply

  4. [...] For more information about buffer pool : https://jonathanlewis.wordpress.com/2007/09/05/hit-ratios-2/ [...]

    Pingback by Good Buffer Distribution « Oracle Logbook — September 14, 2007 @ 3:30 pm BST Sep 14,2007 | Reply

  5. [...] For more information about buffer pool : https://jonathanlewis.wordpress.com/2007/09/05/hit-ratios-2/ [...]

    Pingback by Objects to put in keep buffer « Oracle Logbook — September 18, 2007 @ 7:41 pm BST Sep 18,2007 | Reply

  6. [...] Statspack, Troubleshooting — Jonathan Lewis @ 11:56 am UTC Sep 20,2007  A little while ago I wrote some notes on why monitoring the buffer cache hit ratio (BCHR) was a pointless exercise. I’ve recently [...]

    Pingback by Hit Ratios (3) « Oracle Scratchpad — September 20, 2007 @ 11:56 am BST Sep 20,2007 | Reply

  7. The newsgroup’s defender of the BCHR seems to believe that the list of arguments I presented above was randomly selected with ulterior motives, rather than being a careful summary of the points he made in his posts.

    However, when I asked him to supply a copy of any parts of his argument that he felt I had unfairly omitted he declined to do so.

    So, dear reader, please feel free to examine every post made on this thread by Bob Jones, and let me know if you can find any part of his argument that I have failed to address.

    Comment by Jonathan Lewis — September 22, 2007 @ 1:40 pm BST Sep 22,2007 | Reply

  8. Despite the BCHR being of limited value, I still find it useful to build a picture of the workload of the databases I monitor.
    Recording the database BCHR over time has alerted me when something has changed, eg a new job that runs frequently without the required indexes. Sure, there are other statistics that could also have been used to detect changes, and the database BCHR doesn’t always alert me to a problem. (Response time / wait event pattern changes are my main method).
    But… to extend the meaning of BCHR a little:
    I use a BCHR at the session level to separate OLTP users from batch jobs.
    I also use BCHR at the SQL and segment levels to provide a hint as to whether full scans or single block gets are used, or how well objects are cached, etc. For example, a small table that has a large percentage of the database’s of physical reads and a low BCHR could be a candidate for the keep buffer pool. (Assuming I can’t change the code or add an index, which is often the case with off-the-shelf applications).
    I do use a lot of other stats too, but the ratio between logical and physical I/Os at various levels help me to understand what is happening in a database.

    Comment by Ari — October 10, 2007 @ 3:11 am BST Oct 10,2007 | Reply

  9. [...] Debunking the Buffer Cache Hit Ratio Jonathan Lewis: 1 Buffer Cache Hit Ratio: 0 [...]

    Pingback by Debunking the Buffer Cache Hit Ratio « die Seilerwerks — October 31, 2007 @ 8:02 pm BST Oct 31,2007 | Reply

  10. [...] have also been surprised when an interview candidate mentioned the buffer cache hit ratio in response to a question regarding a good way of investigating users complaining about performance [...]

    Pingback by Advertisement: Nominet are recruiting « jarneil — January 15, 2008 @ 9:05 am BST Jan 15,2008 | Reply

  11. [...] https://jonathanlewis.wordpress.com/2007/09/05/hit-ratios-2/  (a very detailed blog entry) “Ratios are highly suspect for monitoring purposes. If you think a ratio is helpful, think carefully about whether you should be monitoring the two underlying values instead. [...]

    Pingback by Faulty Quotes 4 – Buffer Cache Hit Ratio (BCHR) « Charles Hooper's Oracle Notes — December 22, 2009 @ 2:47 pm BST Dec 22,2009 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,453 other followers