I thought that most people had managed to get over the buffer cache hit ratio (BCHR) as an aid to monitoring an Oracle system, so I was a little surprised to see a recent question on comp.databases.oracle.server about how to calculate it “properly” turning into a rather long thread – about 120 posts to date.
The consensus was that there was no meaningful way in which the BCHR could be used to monitor an Oracle system.
There was, however, one staunch defender of the BCHR. I have extracted all the arguments he presented over the course of the thread – slightly re-worded for compactness – and thought I would review his case. This, in summary, is what I believe he had to say:
- In practice, you will observe it over a period of time.
- Why is BHCR meaningless? The answer should be short and simple.
- Try asking yourself the same questions about any other indicators you consider meaningful.
- Given everything else being equal, high BCHR is always better than low BHCR.
- BCHR alone does not tell you about overall performance. It simply tells you the percentage of reads from disk. It is an indicator, a very meaningful one.
1) In practice you will observe it over a period of time.
I’ve put this observation first because it’s important to know how a metric is being used before you can decide whether or not it is likely to be useful.
Unfortunately all we were told was that the value should be observed over time, with no further comment about what you might be looking for and what would prompt you to take any further action. But see point (2).
2) Why is BHCR meaningless? The answer should be short and simple.
The BCHR is meaningless because (a) if you make your code more efficient the BCHR may go up, go down, or stay the same; (b) if you make your code less efficient the BCHR may go up, go down, or stay the same; (c) if you don’t touch a single line of your code, a small change in data – with no change in execution plan – may make the BCHR go up or down (or stay the same).
So whatever you see when you look at the BCHR, whether it is smaller, the same, or bigger than last time, you have to look elsewhere to interpret the value and see if there is any reason for taking action. This means the BCHR is redundant – which makes it a meaningless indicator for monitoring purposes.
For a more wordy answer to this question, you could look at some material I published a few years ago (the article for the IOUG is the more serious one, the UKOUG one makes a serious point in a less than serious way).
3) Try asking yourself the same questions about any other indicators you consider meaningful.
This is a good point – you may discover that some of your other metrics are meaningless if you review them all.
The suggestion, however, was made in response to someone quoting Connor McDonald’s “pick a hit ratio” script and was intended to discount Connor’s point by focusing on the fact that you can distort any metric if you try hard enough.
It’s true: every metric that is a ratio suffers from exactly the same problem – but Connor’s point wasn’t about using dirty tricks to manipulate the metrics, it was about how some unlucky piece of code could make the metric completely useless.
If you want a real-life example of code that could “randomly” produce massive numbers of buffer visits and distort your BCHR in a “real” system, you need look no further than this old article of mine.
4) Given everything else being equal, high BCHR is always better than low BHCR.
Viewed literally this statement contradicts itself. If you re-run a fixed batch of tasks and ensure that everything else really is “being equal” then the BCHR can’t change because the BCHR id derived (in most versions of the formula) from three other things – and if they don’t change the BCHR can’t change.
So you have to ask how much variation you are going to allow in your environment or task in order to do “the same thing” but still see some change in the BCHR that you can judge to be better or worse. Which takes us back to point (2) – the workload and performance could be better, worse, or unchanged regardless of what happens to the BCHR.
Example 1 - in your snapshot period some job hits a problem, rolls back, and restarts. Rolling back may very well produce double the logical I/O of the original task – because of the way undo records are acquired and marked during rollback. Moreoever, the data that you accessed to make the original change may have been read from disc initially but still be in memory when you get to the rollback. When the job restarts, the required data is, again, much more likely to be in memory. So failing, rolling back and restarting can increase your logical I/O with no increase in the disc reads – which means your BCHR goes up while your job consumes more resources and slows down. (Don’t forget that rolling back typically produces a large fraction of the original volume of redo as well).
Example 2 – a big update in one stream that normally finishes just before a big query in another stream takes a few seconds longer than usual and the two overlap. Instead of doing the usual little bit of logical I/O with the undo segment header to sort out delayed block cleanout on some of the updated blocks, the query in the second stream now does a huge amount of work acquring undo records from the relevant undo segment so that it can make the updated (but not committed in time) blocks read-consistent. The BCHR probably goes up “nicely”, but the work load increases and performance goes down. (The actual change in the BCHR depends on the exact nature of the update and the query – it might go up, go down, or stay the same: the only thing you can be sure of is that the workload will increase whichever way it moves).
5) BCHR alone does not tell you about overall performance. It simply tells you the percentage of reads from disk. It is an indicator, a very meaningful one.
So the BCHR alone does not tell you about performance. Correct, it doesn’t – but if it’s “the BCHR plus X, Y and Z” telling you about performance, then it’s really “X, Y, and Z” telling you about performance. The BCHR is redundant, and may actually persuade you that you don’t need to look at X, Y and Z.
Of course, if you do monitor the BCHR over time, you may well decide that a significant change in the BCHR is worth investigating and it may make you look at X, Y and Z and take appropriate action. Remember though, whilst this is better than not monitoring anything at all, what you’ve done was not a consequence of the BCHR being a meaningful indicator. Just think about:
- the number of times you looked at X, Y, and Z when you didn’t need to – because the BCHR told you to.
- the number of times you didn’t look at X, Y, and Z when you should have (not that you’ll know how many times that was, of course, because the BCHR wasn’t telling you to).
I suppose you could play the semantic game – the BCHR is meaningful because it has a definition, so it means what its definition defines it to mean. But I think in ordinary conversation when we ask about “how meaningful” something is, we are asking for meaning beyond the simple definition.
But think about what the BCHR is apparently trying to tell us (and technically that should be be more like “100 – percentage of blocks that had to be loaded into Oracle’s cache before the data could be used.” Why is that a meaningful indicator by itself ? Remember the example I reported a little while ago where the performance impact was dependent not only on the number of “physical reads” but also – and, in fact, more signifcantly – on the disc response time.
Ratios are highly suspect for monitoring purposes. If you think a ratio is helpful, think carefully about whether you should be monitoring the two underlying values instead.
The buffer cache hit ratio (BCHR) is a particularly bad example of the genre as there are so many events that make the attempt to correlate BCHR and performance meaningless.