Oracle Scratchpad

January 7, 2007

Analysing Statspack (3)

Filed under: Statspack — Jonathan Lewis @ 9:22 pm BST Jan 7,2007

[Further Reading on Statspack]

The output I want to look at in this example doesn’t come from statspack – but it does give you an important reminder about how statspack can deceive because it hides information (by averaging). The following text comes from a post on Ask Tom:

We have a three tier application with Oracle 8i as the database. We2 are experiencing timeout issues (from the application log file) almost everyday. Most of the time the system is very slow.

I get the follwing infomation from v$system_event:


Thu Dec 14
                               System-wide Wait Analysis
                                for current wait events
Event                                Total       Total
Name                                 Waits    Timeouts
------------------------------ ----------- -----------
null event                               1           1
latch free                          13,484      13,478
pmon timer                         273,516     272,088
rdbms ipc reply                      4,390           0
rdbms ipc message                1,421,940   1,145,272
enqueue                              4,302           6
reliable message                         1           0
control file sequential read       176,526           0
control file parallel write        324,022           0
refresh controlfile command             17           0
local write wait                       558           0
buffer busy waits                   32,807           4
log file sequential read            56,750           0
log file single write                8,384           0
log file parallel write            148,912           0
LGWR wait for redo copy                 59           7
log buffer space                    70,206           1
log file switch (checkpoint in       1,071         545
complete)
switch logfile command               1,410           0
log file switch completion           3,097         443
log file sync                       59,280          64
db file sequential read          6,830,195           0
db file scattered read           9,829,549           0
db file parallel write               7,847           0
direct path read                 7,408,664           0
direct path write                2,619,359           0
direct path read (lob)           4,030,001           0
direct path write (lob)              1,275           0
instance state change                    1           0
smon timer                           2,739       2,732
library cache pin                        3           0
file identify                       21,002           0
file open                          167,006           0
virtual circuit status              27,313      27,313
dispatcher timer                    13,661      13,661
SQL*Net message to client       82,049,522           0
SQL*Net more data to client     28,064,378           0
SQL*Net message from client     82,049,497           0
SQL*Net more data from client        6,594           0
SQL*Net break/reset to client       52,752           0

Of course, this looks nothing like statspack, and doesn’t have any timing information at all – but it does contain some of the most commonly used information. So what can you see in these figures, and what questions could you ask?

First of all, we can see from the smon timer that the database has been up a little under 10 days (smon times out every 5 minutes, pmon every three seconds) so, whatever has been happening, any important peaks in activity will have been smoothed off and may be invisible.

Most of the time the system is slow:

Just looking for big numbers, it’s fairly “obvious” that the problem is I/O – look at the millions of sequential, scattered, and direct path reads. Except, of course, this is a 10 day summary – so a total of 24 Million (6.8 + 9.8 + 7.4) physical read requests is 2.4M per day, which is 100,000 per hour which is only 30 per second – which doesn’t sound so big after all. Even if all the work is done in an 8-hour day, that’s only 90 per second. Can we infer a threat – not without timing and a finer granularity.

Of course, we can also see 82 Million SQL*Net “roundtrips” – and quite a lot of those round-trips also seem to entail “SQL*Net more data to client”. That seems to be a very busy client and a busy network: until you work out per second activity, which turns it into 100 round-trips per second (or, for an 8 hour working day, 300 per second). Can we infer a threat – this time we might be rather more suspicious, except we don’t know how many concurrent sessions there are to make that many trips – maybe it’s a well-honed 3,000 user/session system, and the real problem is CPU – but for the moment I’m going to flag up application design as a potential issue. Again though, we need better granularity – we need to see peaks and troughs in workload.

Then there’s the matter of the LOBs – we see lots of “direct path read (lob)”. About half the direct path reads are LOBs. This suggests the LOBs are declared as nocache. It seems likely that we could get some performance wins by caching these LOBs to reduce I/O – except we don’t know that I/O is the problem, and we don’t know what side-effects we may introduce by caching those LOBs, and we don’t can’t guess the potential benefit of caching them anyway – but we can flag it up for further investigation.

We are experiencing timeout issues

Some of the small numbers – even with relatively small amounts of time lost (not that it’s recorded) – might be the cause of this problem. And those timeouts might be telling us why “the system is slow” the rest of the time.

There are 4,000 enqueues – that’s only 400 per day, which is only 50 per hour if we assume that they occur in the working day – and most of them persist for less than 3 seconds (the typical enqueue timeout interval). It might take just a couple of seconds of simple data locking to cause an application time-out – and we are seeing an average of a little under one enqueue wait per minute. But, again, we don’t know how many concurrent sessions there are, and how to share out these 4,000 waits.

There are 1,701 “logfile switch (checkpoint incomplete)” waits – when one of those happens you can’t update the database so maybe that helps to explain the application timeouts (not that these people seem to be doing much in the way of updating the database – but we can’t really tell that from wait states). Strangely enough, there are 1,410 waits for “logfile switch command” – in 10 days, that means that something is doing “alter system switch logfile” roughly every 10 minutes. Maybe these events are connected – maybe not.

The fact that the log file switch has been coded suggests that someone was thinking ‘busy night time, quiet day time’ – setting dbms_job to switch log files in the day time so that they could have large log files for the night-time processing. Is this a hangover from another system? Standard in-house practice? Or is it an indication that very little update happens during the day, and any clues about update contention in the figures should be ignored as far as our hunt for end-user problems is concerned ?

Even so, there is something very strange about the log-related figures – we see hardly any data block writes, yet we have indications of excess log file activity (and we must be generating quite a bit of log file or we wouldn’t be likely to see checkpoint incomplete and log buffer space waits unless the log buffer and log files were incredibly tiny – better check basic configuration before worrying too much about statspack).

The “LGWR wait for redo” is also an interesting anomaly. It doesn’t happen often, and you would normally associate it with a system doing lots of small highly concurrent updates – which this system is clearly not doing. It’s only a tiny number, but it’s another hint that CPU saturation may a problem.

Possibly the system keeps updating a few items of data in the same few blocks and only commits occasionally (there are only 59,000 commits (log file syncs) in 10 days) – that could help to explain the logging symptoms and the buffer busy waits; again, we have to go back to the application for inspiration. At a minimum we really need to see v$sysstat to see the volume of redo, not just the number of resulting write waits.

Let’s get back to the LOBs – there are just a few “direct path write (lobs)”. If the LOBs are declared as “nocache logging” a few large LOB inserts or updates might be just enough to cause a few peaks of log buffer activity, jamming the system briefly, whilst still looking relatively harmless on average. Again we see the need for finer granularity simply to get a clue about whether the problems relate to peaks in activity which simply disappear when the statistics are averaged across a long time.

Another thought about LOBs of course, is what is being done to them after they are read ? Is it some pl/sql post-processing in Oracle , or even the procesing in the middle-tier, that is taking up all the time. Is it really just a complete waste of time looking at wait events  in this case.  (XML springs to mind as a culprit – but that’s not based on a scientific analysis of the available data)

Conclusions

The SQL*Net activity hints at the possibility that the middle-tier is doing something very inefficient, but apart from that one very vague suggestion there is nothing in this report from v$system_events that tells you where to make good use of your time.

There are a couple of quick configuration sanity checks (log file and log buffer sizing) and the question about why there is a 10-minute log file switch (which could be replaced in 9i by setting the archive_lag_timeout). But these don’t look as if they are going to be particularly relevant to the end-user performance issues.

There is the (notionally) easy option to switch the LOBs to caching – but that might mean some other infrastructure changes which need careful investigation, and it’s not conclusive that this would improve the situation.

There are several other inferences that you could make about where the problems might be. (It’s taken me three hours to type up this note, it took me about 5 minutes reading the stats to come up with several more possibilities – which I haven’t written down because (a) it would take too long and (b) in real-life I would eliminate most of them by cross-checking against v$sysstat, v$enqueue_stat, v$filestat, v$sql_workarea, v$segstat, v$latch, v$sql and a few others – but mostly v$sysstat/v$sesstat).

Clearly you need to see timing information to get a better idea of where things are slow; but you also need a much finer granularity of reporting – which you get from taking a statspack snapshot at level zero every 15 minutes or so – so that you don’t average out the peaks and hide the problems. And in this case, a report across one or two 15 minutes snapshots around the time of an application timeout might also produce some matching Oracle symptoms or anomalies.

However, looking at wait times alone is not enough. You need, as a minimum, to know about the volume of work being done behind those waits – how much redo was generated to cause the redo writes, what sort of activity was behind the direct path reads, was the low volume of physical I/O the consequence of an extreme volume of logical I/O etc., etc. etc. So you need to be able to cross-reference wait-time against workload in various ways.

The difficulty, ultimately, is that if you don’t have a reasonably scoped set of data, and a reasonably complete set of figures, and a reasonable understanding of how different statistics relate to each other, you just end up with a great long list of “maybes” when you really need a short list of “likelies” to make best use of your time.

Whenever I take on a trouble-shooting engagement, I try to end up with a report that lists about five options in order of desirability – because that’s what the client usually needs.

I also aim to have a couple of those options rated as “probably the best thing to do”, a couple of “might help”, and a couple of “easy, but probably not much benefit”. I also make sure that I give a clear idea of the investment required for each option, and the benefit.

[Further Reading on Statspack]

Footnote: Any advice about reading Statspack reports is almost always relevant when reading AWR reports (and vice versa).

10 Comments »

  1. Our client has similar (3 tier) architecture, and the same symptoms (the system is slow), but even Statspack reports do not reveal any real database bottlenecks. Except randomly run batch jobs which hog the system every now and then – but that does not explain reasons for being slow all the time.
    Our application runs on J2EE and uses Hibernate. We (DBAs) know it does excessive reads from database (you wouldn’t believe how much is being read in order to fill a few widgets in GUI) and updates every table column in a row eventhough only one changes value, but we have hard time persuading Java programmers to optimize, because sometimes the system just runs OK. So it must be the database :)

    Anyhow I find your series of Statspack articles to be most instructive reading about troubleshooting Oracle performance problems – so please keep up the good work.

    Comment by matjazc — January 7, 2007 @ 11:15 pm BST Jan 7,2007 | Reply

  2. >>
    Clearly you need to see timing information to get a better idea of where things are slow; but you also need a much finer granularity of reporting – which you get from taking a statspack snapshot at level zero every 15 minutes or so – so that you don’t average out the peaks and hide the problems. And in this case, a report across one or two 15 minutes snapshots around the time of an application timeout might also produce some matching Oracle symptoms or anomalies.
    >>

    Spot on and I do agree with the above statement. The three hours work you have done to analyze the report and explaining the cause possibilities, I feel that they are simply on guess work without any real timely information.

    Since the output was generated from v$ performance views, as these views produces accumulated statistics since the database was up. It could be that somewhere in 10days of time, one day or one hors most of the problems occurs due to multiple reasons, sadly, it accumates in v$sys* views as accumated values. That really hard to figure out the exact problems.
    As you suggest, making an habbit of taking 15min statspack report (as automated AWR in 10g) >
    Next in this series: a strategy for becoming more familiar with statspack.
    >>

    I really look forward for your statspack serious.

    Jaffar

    Comment by Jaffar — January 8, 2007 @ 2:58 pm BST Jan 8,2007 | Reply

  3. Jaffar, Guesswork is correct – that’s why I only spent 5 minutes on looking at the report.

    The 3 hours was the time it took to write down why the observations from the 5 minutes shouldn’t be considered anything more than a slight clue that might actually point you in the wrong direction.

    Comment by Jonathan Lewis — January 8, 2007 @ 5:18 pm BST Jan 8,2007 | Reply

  4. Hi Jonathan,

    I suspect that my system is affected by too much cache and cache buffer chain related issue. I have made a brief analysis in Tom Kyte’s ‘Ask Tom’ forum: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:3675961676045#725258000346571232 (Reviewer: Nilanjan Ray from UK)

    Related to my post I would like to:

    1. Quantify the CPU consumption percentage due to latch spin. Is it possible through any 10g metrics?

    2. What does v$latch.wait time tell me?
    I have seen in my system, for cache buffer chain latch where there are zero spin gets, but misses and spin gets are noticeable. But interestingly wait_time is zero. Does this mean that the v$latch.wait_time (or stats$latch.wait_time or dba_hist_latch.wait_time)is a is measure of how much the spin cost for the latch. On the other hand whenever I have spin gets I get a non zero wait_time. Is the v$latch.wait_time sufficient to reflect CPU issues due to latch spin. Your opinions and views on this please.

    Comment by Nilanjan Ray — March 25, 2008 @ 10:05 pm BST Mar 25,2008 | Reply

  5. Nilanjan,

    1. I don’t think there’s any way of quantifying the CPU due to latch spins. But if I see a lot of latch spins – combined with multiple sleeps on the latch – I may decide that there is excessive activity on the latch which needs to be addressed by reviewing the use of the objects protected by that latch. (i.e. I use the latch spin count as an indicator of the underlying issue, rather than an end in itself).

    2. There are version-dependent issues with v$latch.wait_time. It should reflect the time reported for latch sleeps in v$system_event but there are a couple of bugs reported in Metalink about the times not being updated correctly (particularly for shared hash latches, I believe – and cache buffers chains is a sharable latch).

    Since you have to spin before you sleep, I am prepared to work on the assumption that even a relatively low number of sleeps on a latch may be associated with a performance issue because of CPU time spent in the preliminary spinning.

    I have to say, though, that when latch sleeps or latch spins stand out there are usually other things that demand my attention first – and usually explain the latch activity.

    Comment by Jonathan Lewis — April 1, 2008 @ 5:02 pm BST Apr 1,2008 | Reply

  6. Thanks Jonathan. I did notice the fomatting thing too after I hot the submit button. Here’s my second attempt with replacing blank spaces with   Please feel free to delete my previous post.
    —————————————————————————————————————————–
    We had a situation when our CPU maxed out and there were a number of processes in the run queue (looked through vmstat)

    We often have situations where we see a number of processes in the run queue and server idle time drops to 0%:

    I have a AWR report during one of this representtative period. though this is a 6 min snap longer period of 15 minutes and 1 hour are representative to the same extent.The highlights of this are:

        --------- ------------------- -------- ---------
    Begin Snap:       979 27-Mar-08 12:36:06       135     158.1
      End Snap:       980 27-Mar-08 12:42:16       134     161.5
       Elapsed:                6.18 (mins)
       DB Time:               85.48 (mins)

    Cache Sizes
    ~~~~~~~~~~~                       Begin        End
         ---------- ----------
           Buffer Cache:     4,096M     4,096M  Std Block Size:         8K
       Shared Pool Size:     1,536M     1,536M      Log Buffer:    10,560K

    Load Profile
    ~~~~~~~~~~~~                            Per Second       Per Transaction
       ---------------       ---------------
      Redo size:            528,708.54             48,199.49
          Logical reads:            182,475.11             16,635.27
          Block changes:              2,913.58                265.62
         Physical reads:                 85.12                  7.76
        Physical writes:                 38.46                  3.51
     User calls:                823.61                 75.08
         Parses:              1,866.30                170.14
    Hard parses:                  2.70                  0.25
          Sorts:                929.24                 84.71
         Logons:                  1.15                  0.10
       Executes:              6,151.98                560.84
           Transactions:                 10.97

      % Blocks changed per Read:    1.60    Recursive Call %:    96.21
     Rollback per transaction %:    0.69       Rows per Sort:    15.45

    Instance Efficiency Percentages (Target 100%)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        Buffer Nowait %:  100.00       Redo NoWait %:  100.00
        Buffer  Hit   %:   99.96    In-memory Sort %:  100.00
        Library Hit   %:   99.95        Soft Parse %:   99.86
     Execute to Parse %:   69.66         Latch Hit %:   99.53
    Parse CPU to Parse Elapsd %:   76.85     % Non-Parse CPU:   99.05

     Shared Pool Statistics        Begin    End
          ------  ------
         Memory Usage %:   79.59   79.24
        % SQL with executions>1:   89.18   85.39
      % Memory for SQL w/exec>1:   92.13   87.34

    Top 5 Timed Events                                         Avg %Total
    ~~~~~~~~~~~~~~~~~~                                        wait   Call
    Event                                 Waits    Time (s)   (ms)   Time Wait Class
    ------------------------------ ------------ ----------- ------ ------ ----------
    CPU time                                          4,127          80.5
    db file sequential read              28,156         167      6    3.2   User I/O
    enq: TX - row lock contention            26          69   2672    1.4 Applicatio
    log file parallel write               3,569          50     14    1.0 System I/O
    log file sync                           765          28     37    0.5     Commit

    Time Model Statistics                  Snaps: 979-980
    -> Total time in database user-calls (DB Time): 5129.1s
    -> Statistics including the word "background" measure background process
       time, and so do not contribute to the DB time statistic
    -> Ordered by % or DB time desc, Statistic name

    Statistic Name                                       Time (s) % of DB Time
    ------------------------------------------ ------------------ ------------
    sql execute elapsed time                              5,039.6         98.3
    DB CPU                                                4,127.1         80.5
    parse time elapsed                                      886.9         17.3
    hard parse elapsed time                                 833.5         16.3
    PL/SQL execution elapsed time                           489.7          9.5
    .
    .
    DB time                                               5,129.1          N/A
    background elapsed time                                 101.6          N/A
    background cpu time                                      30.6          N/A
      -------------------------------------------------------------

    Since the contribution of the wait time is negligible to the total response time I have not included anything from the Wait section:

    With the above data I would think:

    1. Since the contribution of the wait time is negligible to the total response time, the bottleneck was CPU. This is evident from the run queue being built up.
    2. Is parsing causing a CPU bottleneck?
    3. Can't rule out latch activities.

    Lt's look at the parse related statistics:

    Statistic                                     Total     per Second     per Trans
    -------------------------------- ------------------ -------------- -------------
    CPU used by this session                    306,594          827.3          75.4
    parse count (failures)                            3            0.0           0.0
    parse count (hard)                            1,002            2.7           0.3
    parse count (total)                         691,621        1,866.3         170.1
    parse time cpu                                3,914           10.6           1.0
    parse time elapsed                            5,093           13.7           1.3

    4. Looking at the % of DB Time spent on parsing (parse time elapsed and hard parse elapsed time) it
    seems that parsing could be a major contributor to the CPU bottleneck.
    5. However looking at the %CPU consumed due to prasing it is only 1.28 % (3,914/306,594*100). This is also reflected
    in the Instance Efficiency % stats of the summary section (% Non-Parse CPU: 99.05).

    What remains next is to look at the latch activities because extensive latch spins/sleeps can contribute to the CPU usage:

    6. I am not interested in the latch wait, because overall the wait time is not significant to the response time component.
    Latches will post waits only when it goes off to sleep after exhausting the complete spin cycles.

    The ones with the most requests are

       Pct    Avg   Wait                 Pct
        Get    Get   Slps   Time       NoWait NoWait
    Latch Name                     Requests   Miss  /Miss    (s)     Requests   Miss
    ------------------------ -------------- ------ ------ ------ ------------ ------
    cache buffers chains        115,809,318    0.3    0.0      3      144,650    0.0
    library cache                 1,422,483    0.2    0.2      2        2,305    0.5
    row cache objects             4,566,016    0.9    0.0      0        1,017    0.0
    session allocation            7,566,491    4.0    0.0      0            0    N/A

    Let's look at some Latch related statistics now.

    Latch Sleep Breakdown                  Snaps: 979-980
    -> ordered by misses desc

    Latch Name
    ----------------------------------------
      Get Requests      Misses      Sleeps  Spin Gets   Sleep1   Sleep2   Sleep3
    -------------- ----------- ----------- ---------- -------- -------- --------
    session allocation
         7,566,491     299,860         571    299,300        0        0        0
    cache buffers chains
       115,809,318     297,482       3,927    293,769        0        0        0
    row cache objects
         4,566,016      42,177          85     42,093        0        0        0
    simulator lru latch
         4,661,844      33,796         189     33,609        0        0        0
    library cache
         1,422,483       2,555         456      2,106        0        0        0
    simulator hash latch
         4,670,410       1,573           6      1,567        0        0        0
      -------------------------------------------------------------

    7.Looking at the Latch sleep breakdown (misses -> spin gets -> sleeps) leaves little doubt that the cache
    cache buffers chains is the reason for the high CPU utilization
    8. At his point I think it is safe to assume that parsing is not the cause of CPU bottleneck.

    To get a more concrete idea on the cache buffer chains latch, let's look at the Latch misses source data:

    Latch Miss Sources                    Snaps: 979-980
    -> only latches with sleeps are shown
    -> ordered by name, sleeps desc

         NoWait              Waiter
    Latch Name               Where                       Misses     Sleeps   Sleeps
    ------------------------ -------------------------- ------- ---------- --------
    In memory undo latch     ktiFlush: child                  0          9        0
    KWQMN job cache list lat kwqmndej: dequeue                0          1        1
    SQL memory manager worka qesmmIRegisterWorkArea           0          1        0
    cache buffer handles     kcbzfs                           0          2        0
    cache buffer handles     kcbzgs                           0          1        3
    cache buffers chains     kcbgtcr: kslbegin excl           0      3,278    4,260
    cache buffers chains     kcbrls: kslbegin                 0      1,731      869
    cache buffers chains     kcbgtcr: fast path               0        245      128
    .
    .
    library cache            kglic                            0        426        4
    library cache            kgldte: child 0                  0          7       65
    library cache            kglpnp: child                    0          5      111
    .
    .
    shared pool              kghalo                           0         17        3
    shared pool              kghupr1                          0          8       20
    shared pool              kghfre                           0          1        2
    shared pool              kghfrunp: clatch: nowait         0          1        0

    9. At this point, looking at the Sleeps and Waiter Sleeps statistics, I would be almost certain that the CPU
    bottleneck is due to cache buffer chain latch.
    10. The next thing to look for is the tables and SQLs with most logical I/O, since these would put most pressure on the cache buffer chain latch.
    The next snippet shows the tables with most LIO.

    Segments by Logical Reads              Snaps: 979-980
    -> Total Logical Reads:      67,622,358
    -> Captured Segments account for   81.7% of Total

       Tablespace                      Subobject  Obj.       Logical
    Owner         Name    Object Name            Name     Type         Reads  %Total
    ---------- ---------- -------------------- ---------- ----- ------------ -------
    SYS        SYSTEM     TS$                             TABLE   18,903,392   27.95
    SCH1       SCH1_DA    ADC_SPOT                        TABLE    6,633,776    9.81
    SCH1       SCH1_DA    SSP_GTMP_PROG_COPY_A S45        TABLE    3,119,376    4.61
    SCH1       SCH1_IN    SDC_CAMPAIGN_MONTH_P            INDEX    3,019,088    4.46
    SCH1       SCH1_DA    ADC_TRAIL_SPOT_PK               INDEX    2,367,776    3.50
      -------------------------------------------------------------    

    11. The SQL with the most I/O turned out to be a SQL from SYS. This is prsed everytime it is executed.

    SQL ordered by Parse Calls             Snaps: 979-980
    -> Total Parse Calls:         691,621
    -> Captured SQL account for      94.9% of Total

        % Total
     Parse Calls  Executions     Parses    SQL Id
    ------------ ------------ --------- -------------
         558,465      558,816     80.75 8sph6b5p41afr
    select min(bitmapped) from ts$ where dflmaxext =:1 and bitand(flags,1024) =1024

    12 On Metalink I found bug 5455880 which said says ‘EXCESSIVE RECURSIVE QUERY ON TS$ WHEN USING A "TABLESPACE GROUP" AS TEMP TS’.We had a temporary tablespace group. Once we removed the temporary tablespace group TS$ disappeared as well as the SQL

    13. The performance on cache buffer chain latch and the system in general improved, can be seen from AWR snippet below taken the next day.
    This is from an hourly snap whereas the earlier was a 6 min snap:

    Latch Name
    ----------------------------------------
      Get Requests      Misses      Sleeps  Spin Gets   Sleep1   Sleep2   Sleep3
    -------------- ----------- ----------- ---------- -------- -------- --------
    cache buffers chains
       738,508,397     313,360       3,927    311,819        0        0        0

    14. However the parse time elapsed have dropped only by 2%. This again confirms by initial observation based on which I started my investigation that overall cache buffer chain is the root cause and parsing is comes way below suspicion.

    Having found the reson for the problem, what continues to baffle me is:

    1. What should have been the usual suspect parsing or excesstive sleeps/spins on the cache buffer chain latch to move ahead with the investigation?
    2. Is this a correct conclusion to make that removing the temporary tablespace group, and hence the SYS generated recurrsive SQL causing more latch related performance issue than it was contributing to parsing
    3. If the statistics % Non-Parse CPU is 99.05 should I concentrate on parse related CPU issues to start off with. Rather should I at all suspect parsing as an issue from a CPU angle.
    5. Would at any point from the report and my analysis, I should have suspected that parsing in general is an issue to cause a
    performance problem.
    6. The ‘DB CPU’ does not correlate with the ‘CPU used by this session’. so does the ‘parse time elapsed’
    values in the Instance activity statistics and the Time Model Statistics.
    7. From the above data (CPU used by this session = 306,594 and parse time cpu = 3,914) will it be fair to say that % of CPU spent parsing is only 1.67% and would have caused negligible impact on CPU?

    Could you please give your opinion on the above?

    Thanks
    Nilanjan

    Comment by Nilanjan Ray — April 17, 2008 @ 1:43 pm BST Apr 17,2008 | Reply

  7. Nilanjan,

    Sorry it’s taken so long to get back to this – I don’t usually let people use the blog as a substitute for AskTom, so things like this are likely to get a low priority.

    I’ve put in a couple of “code” tags to improve the formatting just a little more – but wordpress isn’t very friendly with fixed fonts.

    First observation:
    If you’re running close to 100% CPU, then any Oracle problem will get exaggerated. So you need to get rid of some of your “real work” CPU. As you’ve pointed out, your wait time is tiny by comparison – so your first task is to sort out where that CPU time is going.

    Your elapsed time is 6:18, with a CPU time recorded of 4,127 seconds. Since you’ve said you’re close to 100% CPU, I’ll assume this means you have roughly 10.9 CPUs – so I’ll guess 12 as a nearby “sensible” number.

    The figure that looks a little suspicios in the “Load profile” is the Executes – at 6,151 per second. That seems a bit high when you have about 12 CPUs. Similarly “user calls” look a little bit of a threat – given that that means possible round-trips activity and pressure on network stacks. The redo and logical I/O are look a little busy, but for a machine with 12 CPUs the figures should be quite sustainable.

    There is an anomaly between the “Timed Event” CPU and the System statistics – the CPU usage is roughly 8.27 seconds per second from the latter, but 10.9 seconds per second from the former. On 10g, though, this may come from the difference in how the two stats are measured – the difference could be due to long running SQL statements crossing the snapshot boundaries.

    Parsing does not seem to be your problem – the parse time CPU is 10.6 seconds from a total of 827.3 seconds – and you’ve got your own figures that took you to the same conclusion.

    Looking at the latches – don’t discount them, and don’t assume that it’s the cache buffers chains latch. The “session allocation” latch is running into a lot of misses and spins – which fits with the high numbers of executes and “user calls”. You may be losing quite a lot of CPU there – I’d also look at the O/S CPU statistics at this point to see how much of your CPU time is SYS time when the machine is busy: maybe your CPU is disappearing in O/S session management.

    That bug/symptom about temporary tablespace groups is nice to know. At this point, with a full AWR, I’d have been looking at the SQL statements using most CPU and with most executions – possibly following up with the ones with the highest version counts.

    The performance on cache buffers chains latch when you took out all these executions could simply have been due to reduction in CPU usage allowing faster turnaround of sessions – i.e. it may not have been directly due to the reduction in cache activity, but to the availability of CPU to allow processes to get to the top of the run queue more quickly.

    So, answers to your questions:
    ——————————
    1) You should have discounted parsing almost immediately and looked at SQL ordered by CPU and SQL ordered by executions.

    2) You removed wastage of CPU – lowering the level of resource usage reduces the contention for the resource.

    3) I don’t read the percentages – I look at the absolute values. But so little CPU was used by parsing you needed to look elsewhere.

    5) (No 4) Your parse count it quite high – but you haven’t shown the ‘session cursor cache hits’ which allow you to discount “parse count” – and your execute count is much higher, suggesting a better place to look for resource usage.

    6) Possibly due to the point in time when costs are accumulated by the different statistics – some information is accumulated every few seconds, some is accumulated only on completion of a call.

    7) Yes.

    Comment by Jonathan Lewis — April 27, 2008 @ 6:17 pm BST Apr 27,2008 | Reply

  8. Will the report shows all the performances or will it shows only the top bad performers.

    Comment by bhanu — February 18, 2009 @ 11:37 am BST Feb 18,2009 | Reply

  9. Bhanu,

    It’s only the top few bad performances, but this is often enough to make a big difference.

    Occasionally you find that none of the top SQL are particularly extreme; in which case it’s worth looking at the entire list to see if they follow some sort of pattern (common table, or common index, say) which allows you to make one structural change to address many statements.

    Comment by Jonathan Lewis — March 21, 2009 @ 11:13 pm BST Mar 21,2009 | Reply

  10. [...] 原文:http://jonathanlewis.wordpress.com/2007/01/07/analysing-statspack-3/ [...]

    Pingback by xpchild » Analysing Statspack (3) — June 27, 2011 @ 7:57 am BST Jun 27,2011 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,985 other followers