Here’s an extract of an AWR (automatic workload repository) snapshot published some time ago on the Internet, along with the text describing why it’s worth seeing. The extract comes from an article by Don Burleson:
Here is an an example of an Oracle 10g database with an undersized log buffer, in this example 512k:Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) DB Time Wait Class log file parallel write 9,670 291 55.67 System I/O log file sync 9,293 278 53.12 Commit CPU time 225 43.12 db file parallel write 4,922 201 38.53 System I/O control file parallel write 1,282 65 12.42 System I/O
I have a problem with this data set, though: there’s nothing in the data to point to the explanation – is it intended to give us a clue of what to look for when the log buffer is too small, and if that’s not the intention why bother printing it. (And there’s nothing in any of the surrounding text to explain how you could discover that the problem was the log buffer).
The other problem with this data set is that there is no scope information – how many CPU’s are there on the box, and how long was the snapshot.
Of course, the fact that the “log buffer space” wait doesn’t appear in the top 5 might make you suspect that it’s not really the most important problem with this database; and the fact that the author didn’t print the “log buffer space” wait line from the next section of the AWR report is rather irritating.
So what could you glean from this extract ? Can you find anything in it that tells you that the log buffer might be too small ?
We are told that the buffer is 512KB – if we assume that the DBA hasn’t configured this deliberately and that this is the default log buffer size, then we have one to four CPUs. That doesn’t give us much extra information though.
We can see that apart from the CPU time most of the reported time is for background processes; only the log file sync time is a foreground wait, and conveniently it happens to be similar to the log file parallel write time – which tends to suggest that this is a low-concurrency system (When we do a log file write there seems to be, on average, just one process waiting for that write to complete and it waits about the same amount of time as the write takes to complete). This still doesn’t give us much to go on.
We note that the average times for writes are:
- log file parallel write – 30.1 m/s
- db file parallel write – 40.8 m/s
- control file parallel write – 50.7 m/s
These figures are not particularly good – so if there are complaints about performance we have an indication that slow disk devices are our most obvious root cause.
Of course, log buffer space waits usually need systems with some degree of concurrent activity – if I’ve issued a commit and made the log writer write I’ll be waiting on a log file sync, so someone else has to fill the log buffer to get a log buffer space wait.
There is one exception to this – if I keep generating fairly large redo records then the log writer could be triggered into writing by one of the “non-commit” events … such as seeing the log buffer one-third full. That could explain why I have more log file parallel write waits than log file sync waits – and why my sync waits total 13 seconds less than my write waits. It would have been so nice to see the stats about the log buffer space waits following a claim that we were looking at a database with an undersized log buffer.
No matter – maybe there are 13 seconds of log buffer space waits. What’s the problem with this machine. We spent 278 seconds waiting on log file sync waits which were dependent on slow log file parallel writes. If we increased the log buffer size we would probably spend an extra 13 seconds in log file sync waiting for slow log file parallel writes. This machine’s problem is probably slow disks.
Of course I wouldn’t normally diagnose a performance problem from just the “Top 5″ events – but if that’s all that’s on offer and it doesn’t seem to corroborate the offered diagnosis then you have to ask for more information. I’d want to check the load profile (how much work was actually happening – such as user calls, executes, transactions), then see what other time was lost just below the “Top 5″ – where, for example, are the disk read times ?
Provisionally: this looks like a small single-user run on a single-CPU box with a slow disk drive; maybe it’s just a testbed Windows laptop.
Practically: if you thought the statistics were supposed to tell you the symptons that would help you spot an undersized log buffer, they don’t.
Here’s another sample from the same author where, again, he seems to suggest that he is showing an extract from an AWR report that is symptomatic of an undersized log buffer.:
Here is a AWR report showing a database with an undersized log_buffer, in this case where the DBA did not set the log_buffer parameter in their init.ora file:
Avg Total Wait wait Waits Event Waits Timeouts Time (s) (ms) /txn log file sequential read 4,275 0 229 54 0.0 log buffer space 12 0 3 235 0.0 Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time CPU time 163,182 88.23 db file sequential read 1,541,854 8,551 4.62 log file sync 1,824,469 8,402 4.54 log file parallel write 1,810,628 2,413 1.30 SQL*Net more data to client 15,421,202 687 .37
Update 1st Sept 2010:I was prompted to re-read this article quite recently and discovered that I hadn’t discussed my reasons for disagreeing with Mr. Burleson’s interpretation of the second set of statistics, so I’ve added the following notes to address that omission.
The first thing to note, of course, is that there actually are some waits for log buffer space – 12 of them, totalling 3 seconds wait time. So there is at least a little justification for thinking that an increase in the log buffer size might make a difference. But before bouncing the database to increase the log buffer you need to examine the context a little more carefully.
- First of all – it’s possible to get a few log buffer space waits as a log file switch takes place, especially in a fairly busy system, almost regardless of how large you make the log buffer – so when the figures for log buffer space waits are very small compared to the figures for log file sync waits it’s probably sensible to ignore them.
- secondly, when you increase the size of the log buffer you make it possible for a process to write more data into the log buffer before a log file write takes place, which means the writes are bigger and take longer, and this can increase the time spent in log file sync waits (this trade-off between log buffer space waits and log file sync waits was a classic choice in earlier versions of Oracle).
- thirdly – in this particular case – the time spent on CPU was more than 163,000 seconds, which makes the 3 seconds of log buffer space waits look rather insignificant; moreover, extreme CPU pressure can have side-effects such as slowing down the rate at which the log buffer can be cleared – hence causing log buffer space waits.
So let’s ignore the log buffer space for the moment – it’s probably not worth any effort, it ‘s quite likely to be a side effect of a more important issue, and attempting to address it may cause more problems than it fixes.
The most significant figure in the Top 5 is, by a long way, CPU – so the obvious thing to do is look for SQL that’s consuming a lot of CPU: a good start would be checking the sections of the AWR report labelled: “SQL ordered by CPU”, “SQL ordered by Executions” and possibly “Segments by Logical Reads”.
A quick look at the other events in the Top 5 is instructive, though.
Notice that the number of log file syncs is large and very close to the number of log file writes – this suggests that there may be a lot of small transactions going on (if you have a small number of larger transactions you tend to see more log file writes than log file syncs.
When you look at the timing you notice that the log file writes are much quicker (1.3 ms) than the log file syncs (4.6 ms) – the difference is typically the time it takes the “redo synch write” message to go to the log writer and the acknowledgement to come back from the log writer. The difference is a classic indicator of CPU starvation – and that also brings us back to the log buffer space waits: if CPU starvation is making it hard for the log writer to get the message to flush the buffer to disc it becomes increasingly likely that the buffer will fill up unexpectedly and cause log buffer space waits.
You’ll also notice that the number of waits for “SQL*Net more data to client” seems rather high. Something is firing queries against this database and fetching lots of data back from it – it would be interesting to see the number of SQL*Net roundtrips to/from client and the number of bytes passed to see if this we can make any assumption about whether this is lots of smallish result sets, or a small number of very large result sets. Round-trip activity is an effective way of overloading your database and should probably be investigated. It’s possible, of course, that the incoming SQL will show itself up under one of the “SQL ordered by …” sections of the report.
The other figure in the Top 5 is the “db file sequential read” – which is operating at a respectable average time of 5.5 m/s – if this were a statspack report I’d look at the Event Histogram report, though, to check that this wasn’t a large number of very fast (i.e. locally cached, CPU-intensive) reads with a relatively small number of very slow reads. We need to account for – and address – that CPU usage somewhere.
The final number, that I haven’t mentioned yet, is the “log file sequential read” – that’s a wait event which can do with a little note of its own so I shall try to find some time to write about it over the next few weeks. In this case the number could simply be the archive processes recording their activity as they read the online redo log files and copy them.
Bottom line: these stats don’t give you any cause to fiddle with the log buffer size. The possible saving is tiny, and there are far more important targets to aim at.
Footnote: Any advice about reading AWR reports is almost always relevant when reading Statspack reports (and vice versa).