I was on a site recently that had a standard strategy for installing Statspack on every single one of its databases – including the 10g databases. They also happened to have the licences that allowed them to run the AWR Report – which was rather nice because there are little bits of the Statspack report that don’t appear in the AWR report, and vice versa.
One little anomaly did hit me in the eye though. Here are the ‘Top five’ sections from matching Statspack and AWR reports from one of their 10.2.0.2 datbases. Spot the difference:
Statspack:
Snapshot Snap Id Snap Time
~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
Begin Snap: 22048 31-Oct-07 04:59:26
End Snap: 22052 31-Oct-07 05:59:27
Elapsed: 60.02 (mins)
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~ ~~~~~~
CPU time 1,738 85.4
db file sequential read 219,767 192 1 9.4
SQL*Net message from dblink 1,549 65 42 3.2
db file scattered read 40,632 12 0 .6
log file parallel write 4,537 10 2 .5
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AWR:
Snap Id Snap Time
~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~
Begin Snap: 5533 31-Oct-07 05:00:22
End Snap: 5534 31-Oct-07 06:00:35
Elapsed: 60.20 (mins)
DB Time: 59.37 (mins)
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~ ~~~~~ ~~~~~~ ~~~~~~~~~~
CPU time 3,211 90.1
db file sequential read 219,977 192 1 5.4 User I/O
db file scattered read 40,632 12 0 0.3 User I/O
log file parallel write 4,599 10 2 0.3 System I/O
db file parallel write 6,617 8 1 0.2 System I/O
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You notice that the reports are out of synch by about one minute. That’s not the reason for the disappearance of the “SQL*Net message from dblink”, by the way – that wait has been demoted (and I’m not convinced it’s right) to the “idle” events in 10g.
The most significant difference, though, is in the CPU time. There were only eight CPUs in the box, and CPU recoding is very fine-grained in 10g – so the one minute slippage isn’t enough to account for 1,500 CPU seconds difference in usage.
So where has that difference come from? The answer lies in the system statistics:
[Edited: 8th Nov 2007: No, it’s not in the system statistics – the figures reported in the next few lines were just an unlucky coincidence that pointed me in completely the wrong direction.]
AWR:
Statistic Total per Second per Trans
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~
CPU used by this session 173,859 48.1 33.0
recursive cpu usage 157,926 43.7 29.9
~~~~~~~
331,785
Statspack:
Statistic Total per Second per Trans
~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~~~~ ~~~~~~~~~~~~
CPU used by this session 173,807 48.3 32.9
recursive cpu usage 157,926 43.9 29.9
Notice that the AWR report of CPU seems to be derived by adding “CPU used by this use” to “recursive cpu usage” – nearly.
Of course AWR and Statspack can’t both be right – and it’s a bit alarming that when you switch from one tool to the other you can get a much larger figure for CPU usage suddenly appearing in the headline stats.
The problem is this – which is “right”, Statspack or AWR ?
It could be that AWR (in it’s mmon guise) is just being a little more clever with accounting for CPU and the way it decides what is or isn’t recursive CPU time.
The results above may even be a bug that’s been fixed by 10.2.0.3 because the first couple of experiments I put together didn’t seem to behave the way of the client’s 10.2.0.2.
Whatever the result, watch out for changes to CPU reporting when you switch from Statspack to AWR. The figures may change dramatically, and you may find it hard to work out whether it’s the new figures or the old that are correct.
Do you have DB CPU value for these cases?
I’d cross-verify this info using trace.
Description of the recursive CPU usage doesn’t seem to have changed for awhile.
On the side note, the “non-user” in the definition seems a little strange, as all calls non-user (as in “automatic”) and could be user made (as in “non-Oracle, 3rd party”) at the same time.
Comment by Vlad Sadilovskiy — November 1, 2007 @ 11:20 pm GMT Nov 1,2007 |
Vlad,
Thanks for the note. The claim in the blog is almost certainly wrong, and I’ve just written in to the Oracle-L list to recant on this one; but there are still a couple of follow-up questions I’m going to have about the implications – so I’ll be leaving the blog in place and commenting on it in a few days.
DB CPU: The statspack figure is 3,210.5 (i.e. the total CPU (approx) as seen also by AWR).
The discrepancy between Statspack and AWR is probably explained (plus or minus a bit) in the OS Statistics from AWR:
SYS_TIME __________ 8,305
USER_TIME _________ 175,722
OS_CPU_WAIT_TIME __ 172,000
I assume you mean this from the reference manuals for the recursive CPU: “Total CPU time used by non-user calls (recursive calls). Subtract this value from “CPU used by this session” to determine how much CPU time was used by the user
calls.
Bear in mind that the manuals often get out of date, and aren’t always exact and unambiguous at the best of times. I suspect that that definition hasn’t changed since 5.1 – and explains why lots of people were still trying to “reduce the recursive CPU” 10 years after pl/sql blocks came on the scene.
“non-user” in this context means the CPU due to the function calls not made directly by the user. The statistic is, after all, only interested in CPU attributed to the Oracle process – it doesn’t know (or care) what is at the client end to the two-task.
Comment by Jonathan Lewis — November 2, 2007 @ 10:08 am GMT Nov 2,2007 |
[…] figures of the later when reporting Top 5 Timed Events. This perhaps holds an explanation for the AWR oddity seen by Jonathan Lewis. It would be interesting to find the relevant clues in the reports or that […]
Pingback by Recursive Calls « Observing Oracle — November 5, 2007 @ 5:26 am GMT Nov 5,2007 |
Thanks, Jonathan. I also did my share of testing, but was not able to reproduce the issue you posted.
It is definitely strange to see this big gap. The possibilities of testing scenarios could be endless. For instance is possible that v$sys_time_model and v$sysstat hold information from different or intersecting sources as I mentioned in “recursive calls” post. Or, perhaps some rare tools or access methods or patterns do not register CPU utilization in the v$sysstat.
I wonder if this can be clarified with OS metrics captured by OS tools not from within the Oracle.
Whatever was included in the CPU time/DB CPU (from sys time model) by AWR, it will be very interesting to discover.
Comment by Vlad Sadilovskiy — November 5, 2007 @ 9:09 pm GMT Nov 5,2007 |
I believe that the issue here is that the data reported as CPU time in 10g comes from DB CPU value in v$sys_time_model as opposed to the ‘CPU used by this session’ value in v$sysstat. There is a major difference between these two values, which is that sysstat value is only updated at the end of the call, whereas the v$sys_time_model value is updated every 5 seconds.
In a situation where there are calls that span snapshots this can lead to gross under/over accounting of the sysstat value, for example a call that takes around 3 hours to complete will appear as zero CPU usage for the first two hours, followed by 3 hours of CPU usage in the last hour.
Comment by Graham Wood — November 7, 2007 @ 10:46 pm GMT Nov 7,2007 |
Graham,
Thanks for the taking the time to comment – especially since you must be very busy getting ready for OpenWorld at the moment.
I can’t do anything at the moment to confirm that your observation is the solution to the puzzle, but I will be back on the client site in a couple of weeks and will check to see if I can find the balancing CPU in a later snapshot.
Ideally, of course, if there had been a single call (or small set of calls) responsible for the difference I would have seen something reported with zero exexutions in the “SQL ordered by CPU” section of the AWR report, with no matching report in any of the equivalent sections of the spreport.
Unfortunately, my AWR report doesn’t show any expensive code with zero executions; and Statspack has been configured with some very large thresholds, so there is virtually nothing in any of the SQL listings – which means I can’t sensibly compare the two to check for samples of SQL (or PL/SQL ) with small differences in execution counts but large differences in resource consumption.
To be continued….
Comment by Jonathan Lewis — November 8, 2007 @ 10:35 am GMT Nov 8,2007 |
[…] to things Oracle, Jonathan Lewis’s Oracle Scratchpad has an item on an AWR oddity. No, not its licensing terms, it’s a discrepancy between time stats as reported by statspack […]
Pingback by Pythian Group Blog » Blog Archive » Log Buffer #70: a Carnival of the Vanities for DBAs — November 9, 2007 @ 6:10 pm GMT Nov 9,2007 |