Oracle Scratchpad

Statspack Distractions

Here’s a very short extract (with camouflage) from a statspack report. It covers a three hour period when a critical overnight batch job was running. The machine has 4 CPUs and 32 disk drives, the database is 9.2.0.4. The problem is that the batch takes about two and a half hours to run, and it needs to run in less than two. So what are you going to fix first?

          

Load Profile 
~~~~~~~~~~~~                            Per Second       Per Transaction 
                                   ---------------       --------------- 
                  Redo size:            113,702.56             23,019.12 
              Logical reads:             33,204.73              6,722.31 
              Block changes:                836.89                169.43 
             Physical reads:              2,054.91                416.02 
            Physical writes:                115.78                 23.44 
                 User calls:                 93.34                 18.90 
                     Parses:                 11.61                  2.35 
                Hard parses:                  0.02                  0.00 
                      Sorts:                 16.88                  3.42 
                     Logons:                  0.07                  0.01 
                   Executes:                 67.18                 13.60 
               Transactions:                  4.94          

Top 5 Timed Events 
~~~~~~~~~~~~~~~~~~                                                     % Total 
Event                                               Waits    Time (s) Ela Time 
-------------------------------------------- ------------ ----------- -------- 
CPU time                                                       24,092    62.89 
db file scattered read                            767,594       6,218    16.23 
db file sequential read                           536,379       2,149     5.61 
direct path read                                   95,788       1,880     4.91 
db file parallel write                             12,821       1,863     4.86  

                                                                   Avg 
                                                     Total Wait   wait    Waits 
Event                               Waits   Timeouts   Time (s)   (ms)     /txn 
---------------------------- ------------ ---------- ---------- ------ -------- 
latch free                         51,641     51,422        506     10      1.0 
direct path write                  85,851          0        414      5      1.6 
log file sync                      10,138         80        270     27      0.2 
enqueue                               343         55        212    618      0.0 
SQL*Net break/reset to clien           56          0        198   3529      0.0 
log file parallel write            34,562     34,217        193      6      0.6 
buffer busy waits                  23,831          0        134      6      0.4          

                                                     CPU      Elapsd 
  Buffer Gets    Executions  Gets per Exec  %Total Time (s)  Time (s) Hash Value 
--------------- ------------ -------------- ------ -------- --------- ---------- 
    234,996,037        1,087      216,187.7   65.5  18252.8  20715.07 2132610682 
Module: JDBC Thin Client 
select  PRODUCTS.ID,    SUPPLIERS.ID ... 
from    PRODUCTS, SUPPLIERS where ...       

                                                     CPU      Elapsd 
 Physical Reads  Executions  Reads per Exec %Total Time (s)  Time (s) Hash Value 
--------------- ------------ -------------- ------ -------- --------- ---------- 
      2,951,745            1    2,951,745.0   13.3   750.49   1306.68 3185433958 
Module: JDBC Thin Client 
update HISTORY SET FLAG=0 WHERE CLASS = 'X'          

      2,393,281            1    2,393,281.0   10.8   728.35   2697.64 1765052785 
Module: JDBC Thin Client 
INSERT INTO SUMMARY(...) select ... from ...          

Given that this report covers three hours, or 10,800 seconds, on a machine with 4 CPUs we have 43,200 seconds to play with; although for the two hour target we should be aiming to use less than 28,800 CPU seconds. So the headline 24,000 CPU seconds in the ‘Top 5′ is an obvious target – especially since the ‘SQL order by Gets’ section tells us that most of this CPU came from running one SQL statement about 1,100 times.

So I spent a couple of hours working out the best way to fixed that critical SQL statement (the “suppliers” table turned out to be a five-table view with a nasty filter subquery built in). On the next batch run the 1,100 executions took a total of 450 CPU seconds instead of 18,000 – and the batch completion time didn’t change.

The 1,100 executions were performed by a set of background processes – they took some of the output from the batch and thrashed it ready for the next day’s batch. So when I had asked “is this code in the batch run?” the answer was, quite correctly, “yes” – but the code wasn’t in the critical path of the batch!

The other two statements I have listed were the most expensive on the physical I/O list, although the very simple update actually came 4th in order of elapsed time.

I fixed the insert/select statement by adding an index and then supplying a stack of hints to lock in an execution path that the optimizer was not going to derive from the gathered statistics (which were misleading). Execution time dropped to less than two minutes – which saved us enough time to hit the target, with about 10 minutes to spare.

My first fix on the update was to add the predicate “where flag != 0″.  It’s surprising how many times a simple detail like that is missed. Now, instead of updating about 5 million rows per night, we update a few hundred. We still do a very large tablescan to find those rows – but in the near future I will be adding a function-based index that identifies exactly the rows that are waiting to be changed and wastes no space indexing rows that don’t need to be changed. The addition of the predicate reduced the update time by about 10 minutes. The index should drop it down to a few seconds – leaving us well within the target time.

The direct impact of fixing the insert/select was a reduction in (a) db file scattered reads – because I was using suitable indexes, and (b) direct path reads and direct path writes - because I wasn’t doing a massive hash join an more.  An indirect benefit was a small reduction in buffer busy waits (of the type that 10g calls “read by other session”)

The direct impact of fixing the update was a reduction in log file parallel writes and log file syncs (because of the 5,000,000 rows that were no longer being updated – worth about about 400MB of redo). With the introduction of the index, we’ll then have another significant drop in db file scattered reads.

There are several more wins to be had from “expensive SQL” – but the remaining heavy hitters are only taking 200 – 400 seconds each, so return on investment is going down, and further investigations may be postponed so that other more pressing work can be undertaken.

The moral of the story: statspack can help you spot the heavy hitters, but sometimes you really need to know which items are in the critical path.

Of course, you may spot items that are off the critical path but are likely to be affecting the performance along the critical path, and it’s always worth picking up things that have no impact on the critical path if the fix is quick, easy and safe and the saving in resources is significant.

9 Comments »

  1. I had a similar situation where it was critical that the “overnight batch processing” completed in 1 hour. This was a new system being put in (replacing 4 SQL servers with a single Oracle server) but because of the operational requirements (Bank), if there was an error in the batch, it had to be restartable and STILL complete within in 1 hour. Therefore, it had to complete in 30 minutes. Gulp.

    But another operational requirement was that the system had to support 100% volume growth without hardware changes. So my batch window shrinks to 15 minutes!

    This is where you start going very pale… 2 million accounts to recalculate in 15 minutes.

    The good news was, exactly as you found, to remove as much as possible from the “critical path” and reduce the amount of processing required (only update the records that need updating… only calculate interest charges on accounts where the balance has changed – otherwise use last nights calculation)

    The result was, with careful SQL analysis and profiling (for which Jonathan, Tom, Conor and Cary can take credit as I must have read every article written by them all!) we were able to get the critical path processing to around 8 minutes even on the busiest days without compromising data integrity, operational requirements or having to use any undocumented tricks.

    Comment by Guy — December 15, 2006 @ 5:09 pm BST Dec 15,2006 | Reply

  2. Just out of curiousity, was this a 4 processor system? Or 4 core system?

    Comment by kevinclosson — December 15, 2006 @ 8:29 pm BST Dec 15,2006 | Reply

  3. Kevin, I don’t know, I didn’t check – I assumed they meant 4 processor.

    Comment by Jonathan Lewis — December 15, 2006 @ 9:39 pm BST Dec 15,2006 | Reply

  4. Jonathan,

    What do you mean “Critical Path” here ? I did not get that.

    Regards,
    Prem

    Comment by Prem — December 21, 2006 @ 1:09 pm BST Dec 21,2006 | Reply

  5. Prem, as an example of the concept: assume I have 10 tasks to complete in the night, but four of them have to run in the right order. It doesn’t really matter if I can make the other six jobs run in a few minutes – the time to complete the batch is the time it takes to run the four jobs one after the other. That sequence of jobs is the critical path.

    Comment by Jonathan Lewis — December 21, 2006 @ 7:35 pm BST Dec 21,2006 | Reply

  6. You mention using hints.

    Is it safe, in these days of 10g, to say that a hint is, in reality, a comand that Oracle will obey; or, is there still a chance that it will be ignored?

    Comment by John — May 10, 2007 @ 10:56 am BST May 10,2007 | Reply

  7. John: a “hint” has always been a command that Oracle should obey.

    However: hints are badly documented and very few people know how any of them really work; there are some real bugs that cause them to be lost; and Oracle has changed the syntax (without documentation) for some hints in 10g.

    Bottom line – if you know what you are doing, hints will always be obeyed unless you happen to hit a bug.

    Comment by Jonathan Lewis — May 10, 2007 @ 3:35 pm BST May 10,2007 | Reply

  8. [...] parole, ad un articolo di Don che fra l’altro riporta ampi tratti di un’altro articolo (http://jonathanlewis.wordpress.com/statspack-distractions/) fra l’altro, cosa che mi lascia basito,  senza inserire un link alla fonte (sembrerebbe che [...]

    Pingback by Corretta informazione « Oracle and other — August 27, 2010 @ 8:46 am BST Aug 27,2010 | Reply

  9. [...] final thought – it’s easy to fix the wrong thing when you’re looking at statspack reports. Although we can see some SQL that is [...]

    Pingback by Analysing Statspack 13 « Oracle Scratchpad — January 7, 2013 @ 6:46 pm BST Jan 7,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,266 other followers