[Further Reading on Statspack]
The output I want to look at in this example doesn’t come from statspack – but it does give you an important reminder about how statspack can deceive because it hides information (by averaging). The following text comes from a post on Ask Tom:
We have a three tier application with Oracle 8i as the database. We2 are experiencing timeout issues (from the application log file) almost everyday. Most of the time the system is very slow.
I get the follwing infomation from v$system_event:
Thu Dec 14 System-wide Wait Analysis for current wait events </code><code>Event Total Total Name Waits Timeouts ------------------------------ ----------- ----------- null event 1 1 latch free 13,484 13,478 pmon timer 273,516 272,088 rdbms ipc reply 4,390 0 rdbms ipc message 1,421,940 1,145,272 enqueue 4,302 6 reliable message 1 0 control file sequential read 176,526 0 control file parallel write 324,022 0 refresh controlfile command 17 0 local write wait 558 0 buffer busy waits 32,807 4 log file sequential read 56,750 0 log file single write 8,384 0 log file parallel write 148,912 0 LGWR wait for redo copy 59 7 log buffer space 70,206 1 log file switch (checkpoint in 1,071 545 complete) switch logfile command 1,410 0 log file switch completion 3,097 443 log file sync 59,280 64 db file sequential read 6,830,195 0 db file scattered read 9,829,549 0 db file parallel write 7,847 0 direct path read 7,408,664 0 direct path write 2,619,359 0 direct path read (lob) 4,030,001 0 direct path write (lob) 1,275 0 instance state change 1 0 smon timer 2,739 2,732 library cache pin 3 0 file identify 21,002 0 file open 167,006 0 virtual circuit status 27,313 27,313 dispatcher timer 13,661 13,661 SQL*Net message to client 82,049,522 0 SQL*Net more data to client 28,064,378 0 SQL*Net message from client 82,049,497 0 SQL*Net more data from client 6,594 0 SQL*Net break/reset to client 52,752 0
Of course, this looks nothing like statspack, and doesn’t have any timing information at all – but it does contain some of the most commonly used information. So what can you see in these figures, and what questions could you ask?
First of all, we can see from the smon timer that the database has been up a little under 10 days (smon times out every 5 minutes, pmon every three seconds) so, whatever has been happening, any important peaks in activity will have been smoothed off and may be invisible.
Most of the time the system is slow:
Just looking for big numbers, it’s fairly “obvious” that the problem is I/O – look at the millions of sequential, scattered, and direct path reads. Except, of course, this is a 10 day summary – so a total of 24 Million (6.8 + 9.8 + 7.4) physical read requests is 2.4M per day, which is 100,000 per hour which is only 30 per second – which doesn’t sound so big after all. Even if all the work is done in an 8-hour day, that’s only 90 per second. Can we infer a threat – not without timing and a finer granularity.
Of course, we can also see 82 Million SQL*Net “roundtrips” – and quite a lot of those round-trips also seem to entail “SQL*Net more data to client”. That seems to be a very busy client and a busy network: until you work out per second activity, which turns it into 100 round-trips per second (or, for an 8 hour working day, 300 per second). Can we infer a threat – this time we might be rather more suspicious, except we don’t know how many concurrent sessions there are to make that many trips – maybe it’s a well-honed 3,000 user/session system, and the real problem is CPU – but for the moment I’m going to flag up application design as a potential issue. Again though, we need better granularity – we need to see peaks and troughs in workload.
Then there’s the matter of the LOBs – we see lots of “direct path read (lob)”. About half the direct path reads are LOBs. This suggests the LOBs are declared as nocache. It seems likely that we could get some performance wins by caching these LOBs to reduce I/O – except we don’t know that I/O is the problem, and we don’t know what side-effects we may introduce by caching those LOBs, and we don’t can’t guess the potential benefit of caching them anyway – but we can flag it up for further investigation.
We are experiencing timeout issues
Some of the small numbers – even with relatively small amounts of time lost (not that it’s recorded) – might be the cause of this problem. And those timeouts might be telling us why “the system is slow” the rest of the time.
There are 4,000 enqueues – that’s only 400 per day, which is only 50 per hour if we assume that they occur in the working day – and most of them persist for less than 3 seconds (the typical enqueue timeout interval). It might take just a couple of seconds of simple data locking to cause an application time-out – and we are seeing an average of a little under one enqueue wait per minute. But, again, we don’t know how many concurrent sessions there are, and how to share out these 4,000 waits.
There are 1,701 “logfile switch (checkpoint incomplete)” waits – when one of those happens you can’t update the database so maybe that helps to explain the application timeouts (not that these people seem to be doing much in the way of updating the database – but we can’t really tell that from wait states). Strangely enough, there are 1,410 waits for “logfile switch command” – in 10 days, that means that something is doing “alter system switch logfile” roughly every 10 minutes. Maybe these events are connected – maybe not.
The fact that the log file switch has been coded suggests that someone was thinking ‘busy night time, quiet day time’ – setting dbms_job to switch log files in the day time so that they could have large log files for the night-time processing. Is this a hangover from another system? Standard in-house practice? Or is it an indication that very little update happens during the day, and any clues about update contention in the figures should be ignored as far as our hunt for end-user problems is concerned ?
Even so, there is something very strange about the log-related figures – we see hardly any data block writes, yet we have indications of excess log file activity (and we must be generating quite a bit of log file or we wouldn’t be likely to see checkpoint incomplete and log buffer space waits unless the log buffer and log files were incredibly tiny – better check basic configuration before worrying too much about statspack).
The “LGWR wait for redo” is also an interesting anomaly. It doesn’t happen often, and you would normally associate it with a system doing lots of small highly concurrent updates – which this system is clearly not doing. It’s only a tiny number, but it’s another hint that CPU saturation may a problem.
Possibly the system keeps updating a few items of data in the same few blocks and only commits occasionally (there are only 59,000 commits (log file syncs) in 10 days) – that could help to explain the logging symptoms and the buffer busy waits; again, we have to go back to the application for inspiration. At a minimum we really need to see v$sysstat to see the volume of redo, not just the number of resulting write waits.
Let’s get back to the LOBs – there are just a few “direct path write (lobs)”. If the LOBs are declared as “nocache logging” a few large LOB inserts or updates might be just enough to cause a few peaks of log buffer activity, jamming the system briefly, whilst still looking relatively harmless on average. Again we see the need for finer granularity simply to get a clue about whether the problems relate to peaks in activity which simply disappear when the statistics are averaged across a long time.
Another thought about LOBs of course, is what is being done to them after they are read ? Is it some pl/sql post-processing in Oracle , or even the procesing in the middle-tier, that is taking up all the time. Is it really just a complete waste of time looking at wait events in this case. (XML springs to mind as a culprit – but that’s not based on a scientific analysis of the available data)
Conclusions
The SQL*Net activity hints at the possibility that the middle-tier is doing something very inefficient, but apart from that one very vague suggestion there is nothing in this report from v$system_events that tells you where to make good use of your time.
There are a couple of quick configuration sanity checks (log file and log buffer sizing) and the question about why there is a 10-minute log file switch (which could be replaced in 9i by setting the archive_lag_timeout). But these don’t look as if they are going to be particularly relevant to the end-user performance issues.
There is the (notionally) easy option to switch the LOBs to caching – but that might mean some other infrastructure changes which need careful investigation, and it’s not conclusive that this would improve the situation.
There are several other inferences that you could make about where the problems might be. (It’s taken me three hours to type up this note, it took me about 5 minutes reading the stats to come up with several more possibilities – which I haven’t written down because (a) it would take too long and (b) in real-life I would eliminate most of them by cross-checking against v$sysstat, v$enqueue_stat, v$filestat, v$sql_workarea, v$segstat, v$latch, v$sql and a few others – but mostly v$sysstat/v$sesstat).
Clearly you need to see timing information to get a better idea of where things are slow; but you also need a much finer granularity of reporting – which you get from taking a statspack snapshot at level zero every 15 minutes or so – so that you don’t average out the peaks and hide the problems. And in this case, a report across one or two 15 minutes snapshots around the time of an application timeout might also produce some matching Oracle symptoms or anomalies.
However, looking at wait times alone is not enough. You need, as a minimum, to know about the volume of work being done behind those waits – how much redo was generated to cause the redo writes, what sort of activity was behind the direct path reads, was the low volume of physical I/O the consequence of an extreme volume of logical I/O etc., etc. etc. So you need to be able to cross-reference wait-time against workload in various ways.
The difficulty, ultimately, is that if you don’t have a reasonably scoped set of data, and a reasonably complete set of figures, and a reasonable understanding of how different statistics relate to each other, you just end up with a great long list of “maybes” when you really need a short list of “likelies” to make best use of your time.
Whenever I take on a trouble-shooting engagement, I try to end up with a report that lists about five options in order of desirability – because that’s what the client usually needs.
I also aim to have a couple of those options rated as “probably the best thing to do”, a couple of “might help”, and a couple of “easy, but probably not much benefit”. I also make sure that I give a clear idea of the investment required for each option, and the benefit.
[Further Reading on Statspack]
Footnote: Any advice about reading Statspack reports is almost always relevant when reading AWR reports (and vice versa).
Our client has similar (3 tier) architecture, and the same symptoms (the system is slow), but even Statspack reports do not reveal any real database bottlenecks. Except randomly run batch jobs which hog the system every now and then – but that does not explain reasons for being slow all the time.
Our application runs on J2EE and uses Hibernate. We (DBAs) know it does excessive reads from database (you wouldn’t believe how much is being read in order to fill a few widgets in GUI) and updates every table column in a row eventhough only one changes value, but we have hard time persuading Java programmers to optimize, because sometimes the system just runs OK. So it must be the database :)
Anyhow I find your series of Statspack articles to be most instructive reading about troubleshooting Oracle performance problems – so please keep up the good work.
Comment by matjazc — January 7, 2007 @ 11:15 pm GMT Jan 7,2007 |
>>
Clearly you need to see timing information to get a better idea of where things are slow; but you also need a much finer granularity of reporting – which you get from taking a statspack snapshot at level zero every 15 minutes or so – so that you don’t average out the peaks and hide the problems. And in this case, a report across one or two 15 minutes snapshots around the time of an application timeout might also produce some matching Oracle symptoms or anomalies.
>>
Spot on and I do agree with the above statement. The three hours work you have done to analyze the report and explaining the cause possibilities, I feel that they are simply on guess work without any real timely information.
Since the output was generated from v$ performance views, as these views produces accumulated statistics since the database was up. It could be that somewhere in 10days of time, one day or one hors most of the problems occurs due to multiple reasons, sadly, it accumates in v$sys* views as accumated values. That really hard to figure out the exact problems.
As you suggest, making an habbit of taking 15min statspack report (as automated AWR in 10g) >
Next in this series: a strategy for becoming more familiar with statspack.
>>
I really look forward for your statspack serious.
Jaffar
Comment by Jaffar — January 8, 2007 @ 2:58 pm GMT Jan 8,2007 |
Jaffar, Guesswork is correct – that’s why I only spent 5 minutes on looking at the report.
The 3 hours was the time it took to write down why the observations from the 5 minutes shouldn’t be considered anything more than a slight clue that might actually point you in the wrong direction.
Comment by Jonathan Lewis — January 8, 2007 @ 5:18 pm GMT Jan 8,2007 |
Hi Jonathan,
I suspect that my system is affected by too much cache and cache buffer chain related issue. I have made a brief analysis in Tom Kyte’s ‘Ask Tom’ forum: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:3675961676045#725258000346571232 (Reviewer: Nilanjan Ray from UK)
Related to my post I would like to:
1. Quantify the CPU consumption percentage due to latch spin. Is it possible through any 10g metrics?
2. What does v$latch.wait time tell me?
I have seen in my system, for cache buffer chain latch where there are zero spin gets, but misses and spin gets are noticeable. But interestingly wait_time is zero. Does this mean that the v$latch.wait_time (or stats$latch.wait_time or dba_hist_latch.wait_time)is a is measure of how much the spin cost for the latch. On the other hand whenever I have spin gets I get a non zero wait_time. Is the v$latch.wait_time sufficient to reflect CPU issues due to latch spin. Your opinions and views on this please.
Comment by Nilanjan Ray — March 25, 2008 @ 10:05 pm GMT Mar 25,2008 |
Nilanjan,
1. I don’t think there’s any way of quantifying the CPU due to latch spins. But if I see a lot of latch spins – combined with multiple sleeps on the latch – I may decide that there is excessive activity on the latch which needs to be addressed by reviewing the use of the objects protected by that latch. (i.e. I use the latch spin count as an indicator of the underlying issue, rather than an end in itself).
2. There are version-dependent issues with v$latch.wait_time. It should reflect the time reported for latch sleeps in v$system_event but there are a couple of bugs reported in Metalink about the times not being updated correctly (particularly for shared hash latches, I believe – and cache buffers chains is a sharable latch).
Since you have to spin before you sleep, I am prepared to work on the assumption that even a relatively low number of sleeps on a latch may be associated with a performance issue because of CPU time spent in the preliminary spinning.
I have to say, though, that when latch sleeps or latch spins stand out there are usually other things that demand my attention first – and usually explain the latch activity.
Comment by Jonathan Lewis — April 1, 2008 @ 5:02 pm BST Apr 1,2008 |
Thanks Jonathan. I did notice the formatting thing too after I hit the submit button. Here’s my second attempt with replacing blank spaces with Please feel free to delete my previous post.
—————————————————————————————————————————–
We had a situation when our CPU maxed out and there were a number of processes in the run queue (looked through vmstat)
We often have situations where we see a number of processes in the run queue and server idle time drops to 0%:
I have a AWR report during one of this representtative period. though this is a 6 min snap longer period of 15 minutes and 1 hour are representative to the same extent.The highlights of this are:
Since the contribution of the wait time is negligible to the total response time I have not included anything from the Wait section:
With the above data I would think:
1. Since the contribution of the wait time is negligible to the total response time, the bottleneck was CPU. This is evident from the run queue being built up.
2. Is parsing causing a CPU bottleneck?
3. Can’t rule out latch activities.
Lt’s look at the parse related statistics:
4. Looking at the % of DB Time spent on parsing (parse time elapsed and hard parse elapsed time) it
seems that parsing could be a major contributor to the CPU bottleneck.
5. However looking at the %CPU consumed due to prasing it is only 1.28 % (3,914/306,594*100). This is also reflected
in the Instance Efficiency % stats of the summary section (% Non-Parse CPU: 99.05).
What remains next is to look at the latch activities because extensive latch spins/sleeps can contribute to the CPU usage:
6. I am not interested in the latch wait, because overall the wait time is not significant to the response time component.
Latches will post waits only when it goes off to sleep after exhausting the complete spin cycles.
The ones with the most requests are
Let’s look at some Latch related statistics now.
7.Looking at the Latch sleep breakdown (misses -> spin gets -> sleeps) leaves little doubt that the cache
cache buffers chains is the reason for the high CPU utilization
8. At his point I think it is safe to assume that parsing is not the cause of CPU bottleneck.
To get a more concrete idea on the cache buffer chains latch, let’s look at the Latch misses source data:
9. At this point, looking at the Sleeps and Waiter Sleeps statistics, I would be almost certain that the CPU
bottleneck is due to cache buffer chain latch.
10. The next thing to look for is the tables and SQLs with most logical I/O, since these would put most pressure on the cache buffer chain latch.
The next snippet shows the tables with most LIO.
11. The SQL with the most I/O turned out to be a SQL from SYS. This is prsed everytime it is executed.
12 On Metalink I found bug 5455880 which said says ‘EXCESSIVE RECURSIVE QUERY ON TS$ WHEN USING A “TABLESPACE GROUP” AS TEMP TS’.We had a temporary tablespace group. Once we removed the temporary tablespace group TS$ disappeared as well as the SQL
13. The performance on cache buffer chain latch and the system in general improved, can be seen from AWR snippet below taken the next day.
This is from an hourly snap whereas the earlier was a 6 min snap:
14. However the parse time elapsed have dropped only by 2%. This again confirms by initial observation based on which I started my investigation that overall cache buffer chain is the root cause and parsing is comes way below suspicion.
Having found the reson for the problem, what continues to baffle me is:
1. What should have been the usual suspect parsing or excesstive sleeps/spins on the cache buffer chain latch to move ahead with the investigation?
2. Is this a correct conclusion to make that removing the temporary tablespace group, and hence the SYS generated recurrsive SQL causing more latch related performance issue than it was contributing to parsing
3. If the statistics % Non-Parse CPU is 99.05 should I concentrate on parse related CPU issues to start off with. Rather should I at all suspect parsing as an issue from a CPU angle.
5. Would at any point from the report and my analysis, I should have suspected that parsing in general is an issue to cause a
performance problem.
6. The ‘DB CPU’ does not correlate with the ‘CPU used by this session’. so does the ‘parse time elapsed’
values in the Instance activity statistics and the Time Model Statistics.
7. From the above data (CPU used by this session = 306,594 and parse time cpu = 3,914) will it be fair to say that % of CPU spent parsing is only 1.67% and would have caused negligible impact on CPU?
Could you please give your opinion on the above?
Thanks
Nilanjan
Comment by Nilanjan Ray — April 17, 2008 @ 1:43 pm BST Apr 17,2008 |
Nilanjan,
Sorry it’s taken so long to get back to this – I don’t usually let people use the blog as a substitute for AskTom, so things like this are likely to get a low priority.
I’ve put in a couple of “code” tags to improve the formatting just a little more – but wordpress isn’t very friendly with fixed fonts.
First observation:
If you’re running close to 100% CPU, then any Oracle problem will get exaggerated. So you need to get rid of some of your “real work” CPU. As you’ve pointed out, your wait time is tiny by comparison – so your first task is to sort out where that CPU time is going.
Your elapsed time is 6:18, with a CPU time recorded of 4,127 seconds. Since you’ve said you’re close to 100% CPU, I’ll assume this means you have roughly 10.9 CPUs – so I’ll guess 12 as a nearby “sensible” number.
The figure that looks a little suspicios in the “Load profile” is the Executes – at 6,151 per second. That seems a bit high when you have about 12 CPUs. Similarly “user calls” look a little bit of a threat – given that that means possible round-trips activity and pressure on network stacks. The redo and logical I/O are look a little busy, but for a machine with 12 CPUs the figures should be quite sustainable.
There is an anomaly between the “Timed Event” CPU and the System statistics – the CPU usage is roughly 8.27 seconds per second from the latter, but 10.9 seconds per second from the former. On 10g, though, this may come from the difference in how the two stats are measured – the difference could be due to long running SQL statements crossing the snapshot boundaries.
Parsing does not seem to be your problem – the parse time CPU is 10.6 seconds from a total of 827.3 seconds – and you’ve got your own figures that took you to the same conclusion.
Looking at the latches – don’t discount them, and don’t assume that it’s the cache buffers chains latch. The “session allocation” latch is running into a lot of misses and spins – which fits with the high numbers of executes and “user calls”. You may be losing quite a lot of CPU there – I’d also look at the O/S CPU statistics at this point to see how much of your CPU time is SYS time when the machine is busy: maybe your CPU is disappearing in O/S session management.
That bug/symptom about temporary tablespace groups is nice to know. At this point, with a full AWR, I’d have been looking at the SQL statements using most CPU and with most executions – possibly following up with the ones with the highest version counts.
The performance on cache buffers chains latch when you took out all these executions could simply have been due to reduction in CPU usage allowing faster turnaround of sessions – i.e. it may not have been directly due to the reduction in cache activity, but to the availability of CPU to allow processes to get to the top of the run queue more quickly.
So, answers to your questions:
——————————
1) You should have discounted parsing almost immediately and looked at SQL ordered by CPU and SQL ordered by executions.
2) You removed wastage of CPU – lowering the level of resource usage reduces the contention for the resource.
3) I don’t read the percentages – I look at the absolute values. But so little CPU was used by parsing you needed to look elsewhere.
5) (No 4) Your parse count it quite high – but you haven’t shown the ‘session cursor cache hits’ which allow you to discount “parse count” – and your execute count is much higher, suggesting a better place to look for resource usage.
6) Possibly due to the point in time when costs are accumulated by the different statistics – some information is accumulated every few seconds, some is accumulated only on completion of a call.
7) Yes.
Comment by Jonathan Lewis — April 27, 2008 @ 6:17 pm BST Apr 27,2008 |
Will the report shows all the performances or will it shows only the top bad performers.
Comment by bhanu — February 18, 2009 @ 11:37 am GMT Feb 18,2009 |
Bhanu,
It’s only the top few bad performances, but this is often enough to make a big difference.
Occasionally you find that none of the top SQL are particularly extreme; in which case it’s worth looking at the entire list to see if they follow some sort of pattern (common table, or common index, say) which allows you to make one structural change to address many statements.
Comment by Jonathan Lewis — March 21, 2009 @ 11:13 pm GMT Mar 21,2009 |
[…] 原文:https://jonathanlewis.wordpress.com/2007/01/07/analysing-statspack-3/ […]
Pingback by xpchild » Analysing Statspack (3) — June 27, 2011 @ 7:57 am BST Jun 27,2011 |