Analysing Statspack (5)

April 19, 2007

Analysing Statspack (5)

Filed under: Statspack,Troubleshooting — Jonathan Lewis @ 8:43 pm BST Apr 19,2007

A few days ago someone emailed me a Statspack report (10g format) because one of their developers was complaining that “the system” was slow, and they thought this was a little odd because the Statspack report for the period didn’t seem to show any indications that “the system” was suffering from any type of overload.

If you look at the headline figures, and throw in the fact that the machine has 4 CPUs, the following 30 minute snapshot looks like the system is pretty idle.

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:             11,495.26             11,857.65
              Logical reads:             55,989.88             57,754.94
              Block changes:                 54.24                 55.95
             Physical reads:                 23.23                 23.96
            Physical writes:                  3.49                  3.60
                 User calls:                 27.22                 28.08
                     Parses:                  8.98                  9.27
                Hard parses:                  0.22                  0.22
                      Sorts:                  2.16                  2.22
                     Logons:                  0.01                  0.01
                   Executes:                 23.50                 24.24
               Transactions:                  0.97                 

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                       806          52.8
db file sequential read                         31,538         312     10   20.4
read by other session                            6,909         279     40   18.3
db file scattered read                             950          38     40    2.5
log file parallel write                          1,689          19     11    1.2
          -------------------------------------------------------------

To paraphrase Cary Millsap: you can’t infer details from a summary. So what you can do in cases like this is go to the developer and say:

“You think the system is slow – give me a demonstration”

and trace what they are doing.

On the other hand, there are cases where a detail has such a massive impact that it stands out from the summary – at least to the extent that there’s a fairly good chance that when you look at the summary, you can make a pretty good guess about the critical detail.

Take a look at the following. The first bit is from the “Top SQL by CPU” section – since the largest time component was CPU, the second bit is from the “Instance Activity” section:

CPU CPU per Elapsd Old
  Time (s)   Executions  Exec (s)  %Total   Time (s)    Buffer Gets  Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
    666.37            1     666.37   82.2     673.45      91,309,041 3893695821
Module: SQL*Plus
BEGIN delete_something; END;               

    163.43        2,147       0.08   20.2     167.92      22,829,998 3889042855
Module: SQL*Plus
DELETE FROM target_table WHERE identification_code =:B1 AND
target_type='A' AND ROWNUM = 1       

    163.18        2,147       0.08   20.1     168.45      22,829,612 2403581060
Module: SQL*Plus
DELETE FROM target_table WHERE identification_code =:B1 AND
target_type='B' AND ROWNUM = 1       

    163.00        2,147       0.08   20.1     168.06      22,817,444 2353247669
Module: SQL*Plus
DELETE FROM target_table WHERE identification_code =:B1 AND
target_type='C' AND ROWNUM = 1       

    161.93        2,147       0.08   20.0     167.77      22,831,939 1017873455
Module: SQL*Plus
DELETE FROM target_table WHERE identification_code =:B1 AND
target_type='D' AND ROWNUM = 1       

     78.37        3,358       0.02    9.7      97.76      18,826,838 1226295128
Module: SQL*Plus
DELETE FROM another_table WHERE identification_code =:B1 AND ROWNUM = 1       

Statistic                                      Total     per Second    per Trans
--------------------------------- ------------------ -------------- ------------
physical reads                                46,369           23.2         24.0
...
table scan blocks gotten                 110,609,808       55,415.7     57,162.7
table scan rows gotten                 8,630,217,805    4,323,756.4  4,460,060.9
table scans (long tables)                          1            0.0          0.0
table scans (short tables)                    13,368            6.7          6.9

Two things stand out – the 4 similar delete statements are probably called (one after the other, possibly driven by a cursor for-loop) by the delete_something procedure. (The numbers add up very convincingly).

Then look at the tablescans and physical reads: 13,368 (short) tablescans, and 110M blocks scanned. Since we have just one long tablescan and a total of only 46,000 physical reads it’s fairly safe to assume that most of those “table scan blocks gotten” are for the short tablescans – which means about 8,000 blocks per tablescan. (In fact, the “SQL by Buffer Gets” section confirmed this estimate with 10,600 gets per execution.) We could do similar arithmetic to decide that each short tablescan must be for around 640,000 rows.

So we have a total of 13,368 “short” – but painful and expensive – tablescans, and the delete_something procedure is responsible for at least 8,600 of them (4 * 2,147). Given that each execution of a delete statement deletes just one row, there’s a lot of work going on for very little return.

It’s a fair bet that when the developer says: “the system is slow”, he (or she) really means “this procedure is slow” – after all, it seems to be about the only thing that’s happening on the system.

It looks like there’s a structural problem with target_table. Possibly something as simple as a missing index on (identification_code, target_type), or perhaps the index exists but the statistics are misleading, or maybe there’s a type mismatch problem with a bind variable that is causing an implicit coercion that makes the index irrelevant.

Whatever it is, this summary seems to have highlighted a detail that is worth fixing – and could be fixed very quickly and easily.

Warning:

Despite the fact that this cursory inspection of a Statspack report shows an extremely obvious resource hog that almost certainly ought to be addressed, there were a couple of other bits of work that seemed unreasonably expensive (the last statement in the SQL extract, for example). It is possible that my analysis is correct – but irrelevant to the specific developer making the original complaint.

On the other hand, it didn’t take long to spot this problem, and it’s pretty easy to go back to the developer with an extract from the Statspack report and ask if this is the bit of code that is performing badly.

Footnote: Any advice about reading Statspack reports is almost always relevant when reading AWR reports (and vice versa).

[Further Reading on Statspack]

Oracle Scratchpad

April 19, 2007