Here’s an interesting extract from an Oracle 9i statspack report taken from Oracle Database 10g Performance Tuning Tips and Techniques*** . The book doesn’t give the snapshot interval, number of CPUs, or any idea about the average number of sessions active during the period, so all the comments I make about the sample are highly speculative.
But, as a pure “thought experiment”, I want to give you a feasible explanation of the figures shown just to give you a flavour of how counter-intuitive a statspack report can be.
Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time --------------------------- ------------ ----------- -------- db file sequential read 399,394,399 2,562,115 52.26 CPU time 960,825 19.60 buffer busy waits 122,302,412 540,757 11.03 PL/SQL lock timer 4,077 243,056 4.96 log file switch (checkpoint incomplete) 188,701 187,648 3.83
The question is – which of these top five figures will have the most impact on the end-users’ perception of system performance?
Let me repeat that this is purely conjectural and proposed for the purposes of enlightenment only, but it’s possible that the performance issue that’s most visible to the end users is the one that’s reporting the shortest time on the list above.
I have no idea how long a time period this snapshot lasted, nor how many sessions were actually involved. The best I can say is that if this was running with one CPU the snapshot must have been at least 267 hours (CPU time / 3,600) and if there were 16 CPUs it was at least 17 hours. Allowing for the fact that people rarely run at 100% CPU, and 17 hours is an odd snapshot time, then possibly it’s 16 CPUs for 24 hours, or 64 CPUs for 6 hours – let’s assume the latter.
When you have your mind operating at the right scale the rather large counts on single block reads and buffer busy waits start to look slightly less dramatic.
Of course when you do extreme amounts of reads you tend to get a lot of buffer busy waits as one session waits for another session to finish reading a block that they both happen to want; and the CPU due to latching of the cache buffers chains and LRU chains can get a bit expensive – so the activity does look a bit undesirable and should be reviewed. (10g conveniently renames this particular read-related wait a “read by other session”)
But take a close look at the times: the average read time is 6.41 milliseconds (total time divided by total reads) and the average buffer busy wait is 4.42 milliseconds – but the average wait for the log file switch checkpoint completion is 0.994 seconds.
There are two important points to note here. First, it’s the end-user sessions that wait on log file switch (checkpoint incomplete), and secondly the timeout on that event is one second (which means this might really be a small number of very long waits cut into one second slices since the average is so close to one second).
Now we don’t know how many end-users hit the checkpoint problem, or how long each end-user waited (ticking the seconds away one at a time), but log file switches shouldn’t be happening all the time and, quite possibly, there’s a lot of query based activity going on that doesn’t generate redo log and therefore doesn’t hit this problem. So let’s play numbers and see what happens.
If, in that hypothetical 6 hour time period, we actually have 188,000 periods of session activity that all wait for one second, and share out the other waits between them, how much extra wait time gets associated with each log switch wait? Just under 2.9 seconds of buffer busy wait, and 13.6 seconds of single block read time – for a total of 16.5 seconds. (That’s just dividing the other times by the 188,000 assumed sessions).
This means that if you have just 15.5 sessions running for every session that hits a log switch wait on DML, the log switch wait becomes the most visible contributor to the jammed sessions’ wait time. And our assumption was that this was a big system, so a 15.5 : 1 split between query and DML activity could easily be that sort of size.
Now I’ve done a lot of hand-waving and playing with numbers – but the argument is viable: it could really happen. This is a big system (or a small system viewed over a very long period of time) that manages to get through a lot of work. There are clear signs of resource wastage and contention from the buffer busy waits – but it is possible that the only performance problem the end-users notice is at the log file switch when some of them go into the checkpoint not complete.
The smallest statspack wait could be the most visitble end-user issue – averages hide the detail and only the sessions can tell.
Footnote: log file switch (checkpoint not complete) happens to be a wait time that can (usually) be eliminated very easily – having more, or bigger, log files is often all that it takes to avoid the checkpoint completion issue. Sometimes, of course, the problem can be reduced by ensuring that the database writer(s) have less to write, or are able to write faster (which typically means changing the fast_start_mttr_target or increasing the number of database writers, respectively). The most appropriate action will only become clear when you examine the log files sizes, log file switch timings, settings for the two log_checkpoint parameters and the rest of the statistics.
*** Specifically Chapter 14 Using Statspack and the AWR Report to Tune Waits and Latches, available as a pdf download from McGrawHill.
Footnote: Any advice about reading statspack reports is almost always relevant when reading AWR reports (and vice versa).