The output I want to look at in this example doesn’t come from statspack – but it does give you an important reminder about how statspack can deceive because it hides information (by averaging). The following text comes from a post on Ask Tom:
We have a three tier application with Oracle 8i as the database. We2 are experiencing timeout issues (from the application log file) almost everyday. Most of the time the system is very slow.
I get the follwing infomation from v$system_event:
Thu Dec 14 System-wide Wait Analysis for current wait events
Event Total Total Name Waits Timeouts ------------------------------ ----------- ----------- null event 1 1 latch free 13,484 13,478 pmon timer 273,516 272,088 rdbms ipc reply 4,390 0 rdbms ipc message 1,421,940 1,145,272 enqueue 4,302 6 reliable message 1 0 control file sequential read 176,526 0 control file parallel write 324,022 0 refresh controlfile command 17 0 local write wait 558 0 buffer busy waits 32,807 4 log file sequential read 56,750 0 log file single write 8,384 0 log file parallel write 148,912 0 LGWR wait for redo copy 59 7 log buffer space 70,206 1 log file switch (checkpoint in 1,071 545 complete) switch logfile command 1,410 0 log file switch completion 3,097 443 log file sync 59,280 64 db file sequential read 6,830,195 0 db file scattered read 9,829,549 0 db file parallel write 7,847 0 direct path read 7,408,664 0 direct path write 2,619,359 0 direct path read (lob) 4,030,001 0 direct path write (lob) 1,275 0 instance state change 1 0 smon timer 2,739 2,732 library cache pin 3 0 file identify 21,002 0 file open 167,006 0 virtual circuit status 27,313 27,313 dispatcher timer 13,661 13,661 SQL*Net message to client 82,049,522 0 SQL*Net more data to client 28,064,378 0 SQL*Net message from client 82,049,497 0 SQL*Net more data from client 6,594 0 SQL*Net break/reset to client 52,752 0
Of course, this looks nothing like statspack, and doesn’t have any timing information at all – but it does contain some of the most commonly used information. So what can you see in these figures, and what questions could you ask?
First of all, we can see from the smon timer that the database has been up a little under 10 days (smon times out every 5 minutes, pmon every three seconds) so, whatever has been happening, any important peaks in activity will have been smoothed off and may be invisible.
Most of the time the system is slow:
Just looking for big numbers, it’s fairly “obvious” that the problem is I/O – look at the millions of sequential, scattered, and direct path reads. Except, of course, this is a 10 day summary – so a total of 24 Million (6.8 + 9.8 + 7.4) physical read requests is 2.4M per day, which is 100,000 per hour which is only 30 per second – which doesn’t sound so big after all. Even if all the work is done in an 8-hour day, that’s only 90 per second. Can we infer a threat – not without timing and a finer granularity.
Of course, we can also see 82 Million SQL*Net “roundtrips” – and quite a lot of those round-trips also seem to entail “SQL*Net more data to client”. That seems to be a very busy client and a busy network: until you work out per second activity, which turns it into 100 round-trips per second (or, for an 8 hour working day, 300 per second). Can we infer a threat – this time we might be rather more suspicious, except we don’t know how many concurrent sessions there are to make that many trips – maybe it’s a well-honed 3,000 user/session system, and the real problem is CPU – but for the moment I’m going to flag up application design as a potential issue. Again though, we need better granularity – we need to see peaks and troughs in workload.
Then there’s the matter of the LOBs – we see lots of “direct path read (lob)”. About half the direct path reads are LOBs. This suggests the LOBs are declared as nocache. It seems likely that we could get some performance wins by caching these LOBs to reduce I/O – except we don’t know that I/O is the problem, and we don’t know what side-effects we may introduce by caching those LOBs, and we don’t can’t guess the potential benefit of caching them anyway – but we can flag it up for further investigation.
We are experiencing timeout issues
Some of the small numbers – even with relatively small amounts of time lost (not that it’s recorded) – might be the cause of this problem. And those timeouts might be telling us why “the system is slow” the rest of the time.
There are 4,000 enqueues – that’s only 400 per day, which is only 50 per hour if we assume that they occur in the working day – and most of them persist for less than 3 seconds (the typical enqueue timeout interval). It might take just a couple of seconds of simple data locking to cause an application time-out – and we are seeing an average of a little under one enqueue wait per minute. But, again, we don’t know how many concurrent sessions there are, and how to share out these 4,000 waits.
There are 1,701 “logfile switch (checkpoint incomplete)” waits – when one of those happens you can’t update the database so maybe that helps to explain the application timeouts (not that these people seem to be doing much in the way of updating the database – but we can’t really tell that from wait states). Strangely enough, there are 1,410 waits for “logfile switch command” – in 10 days, that means that something is doing “alter system switch logfile” roughly every 10 minutes. Maybe these events are connected – maybe not.
The fact that the log file switch has been coded suggests that someone was thinking ‘busy night time, quiet day time’ – setting dbms_job to switch log files in the day time so that they could have large log files for the night-time processing. Is this a hangover from another system? Standard in-house practice? Or is it an indication that very little update happens during the day, and any clues about update contention in the figures should be ignored as far as our hunt for end-user problems is concerned ?
Even so, there is something very strange about the log-related figures – we see hardly any data block writes, yet we have indications of excess log file activity (and we must be generating quite a bit of log file or we wouldn’t be likely to see checkpoint incomplete and log buffer space waits unless the log buffer and log files were incredibly tiny – better check basic configuration before worrying too much about statspack).
The “LGWR wait for redo” is also an interesting anomaly. It doesn’t happen often, and you would normally associate it with a system doing lots of small highly concurrent updates – which this system is clearly not doing. It’s only a tiny number, but it’s another hint that CPU saturation may a problem.
Possibly the system keeps updating a few items of data in the same few blocks and only commits occasionally (there are only 59,000 commits (log file syncs) in 10 days) – that could help to explain the logging symptoms and the buffer busy waits; again, we have to go back to the application for inspiration. At a minimum we really need to see v$sysstat to see the volume of redo, not just the number of resulting write waits.
Let’s get back to the LOBs – there are just a few “direct path write (lobs)”. If the LOBs are declared as “nocache logging” a few large LOB inserts or updates might be just enough to cause a few peaks of log buffer activity, jamming the system briefly, whilst still looking relatively harmless on average. Again we see the need for finer granularity simply to get a clue about whether the problems relate to peaks in activity which simply disappear when the statistics are averaged across a long time.
Another thought about LOBs of course, is what is being done to them after they are read ? Is it some pl/sql post-processing in Oracle , or even the procesing in the middle-tier, that is taking up all the time. Is it really just a complete waste of time looking at wait events in this case. (XML springs to mind as a culprit – but that’s not based on a scientific analysis of the available data)
The SQL*Net activity hints at the possibility that the middle-tier is doing something very inefficient, but apart from that one very vague suggestion there is nothing in this report from v$system_events that tells you where to make good use of your time.
There are a couple of quick configuration sanity checks (log file and log buffer sizing) and the question about why there is a 10-minute log file switch (which could be replaced in 9i by setting the archive_lag_timeout). But these don’t look as if they are going to be particularly relevant to the end-user performance issues.
There is the (notionally) easy option to switch the LOBs to caching – but that might mean some other infrastructure changes which need careful investigation, and it’s not conclusive that this would improve the situation.
There are several other inferences that you could make about where the problems might be. (It’s taken me three hours to type up this note, it took me about 5 minutes reading the stats to come up with several more possibilities – which I haven’t written down because (a) it would take too long and (b) in real-life I would eliminate most of them by cross-checking against v$sysstat, v$enqueue_stat, v$filestat, v$sql_workarea, v$segstat, v$latch, v$sql and a few others – but mostly v$sysstat/v$sesstat).
Clearly you need to see timing information to get a better idea of where things are slow; but you also need a much finer granularity of reporting – which you get from taking a statspack snapshot at level zero every 15 minutes or so – so that you don’t average out the peaks and hide the problems. And in this case, a report across one or two 15 minutes snapshots around the time of an application timeout might also produce some matching Oracle symptoms or anomalies.
However, looking at wait times alone is not enough. You need, as a minimum, to know about the volume of work being done behind those waits – how much redo was generated to cause the redo writes, what sort of activity was behind the direct path reads, was the low volume of physical I/O the consequence of an extreme volume of logical I/O etc., etc. etc. So you need to be able to cross-reference wait-time against workload in various ways.
The difficulty, ultimately, is that if you don’t have a reasonably scoped set of data, and a reasonably complete set of figures, and a reasonable understanding of how different statistics relate to each other, you just end up with a great long list of “maybes” when you really need a short list of “likelies” to make best use of your time.
Whenever I take on a trouble-shooting engagement, I try to end up with a report that lists about five options in order of desirability – because that’s what the client usually needs.
I also aim to have a couple of those options rated as “probably the best thing to do”, a couple of “might help”, and a couple of “easy, but probably not much benefit”. I also make sure that I give a clear idea of the investment required for each option, and the benefit.
Footnote: Any advice about reading Statspack reports is almost always relevant when reading AWR reports (and vice versa).