Here’s a very short extract (with camouflage) from a statspack report. It covers a three hour period when a critical overnight batch job was running. The machine has 4 CPUs and 32 disk drives, the database is 9.2.0.4. The problem is that the batch takes about two and a half hours to run, and it needs to run in less than two. So what are you going to fix first?
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 113,702.56 23,019.12
Logical reads: 33,204.73 6,722.31
Block changes: 836.89 169.43
Physical reads: 2,054.91 416.02
Physical writes: 115.78 23.44
User calls: 93.34 18.90
Parses: 11.61 2.35
Hard parses: 0.02 0.00
Sorts: 16.88 3.42
Logons: 0.07 0.01
Executes: 67.18 13.60
Transactions: 4.94
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time 24,092 62.89
db file scattered read 767,594 6,218 16.23
db file sequential read 536,379 2,149 5.61
direct path read 95,788 1,880 4.91
db file parallel write 12,821 1,863 4.86
Avg
Total Wait wait Waits
Event Waits Timeouts Time (s) (ms) /txn
---------------------------- ------------ ---------- ---------- ------ --------
latch free 51,641 51,422 506 10 1.0
direct path write 85,851 0 414 5 1.6
log file sync 10,138 80 270 27 0.2
enqueue 343 55 212 618 0.0
SQL*Net break/reset to clien 56 0 198 3529 0.0
log file parallel write 34,562 34,217 193 6 0.6
buffer busy waits 23,831 0 134 6 0.4
CPU Elapsd
Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
234,996,037 1,087 216,187.7 65.5 18252.8 20715.07 2132610682
Module: JDBC Thin Client
select PRODUCTS.ID, SUPPLIERS.ID ...
from PRODUCTS, SUPPLIERS where ...
CPU Elapsd
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
2,951,745 1 2,951,745.0 13.3 750.49 1306.68 3185433958
Module: JDBC Thin Client
update HISTORY SET FLAG=0 WHERE CLASS = 'X'
2,393,281 1 2,393,281.0 10.8 728.35 2697.64 1765052785
Module: JDBC Thin Client
INSERT INTO SUMMARY(...) select ... from ...
Given that this report covers three hours, or 10,800 seconds, on a machine with 4 CPUs we have 43,200 seconds to play with; although for the two hour target we should be aiming to use less than 28,800 CPU seconds. So the headline 24,000 CPU seconds in the ‘Top 5′ is an obvious target – especially since the ‘SQL order by Gets’ section tells us that most of this CPU came from running one SQL statement about 1,100 times.
So I spent a couple of hours working out the best way to fixed that critical SQL statement (the “suppliers” table turned out to be a five-table view with a nasty filter subquery built in). On the next batch run the 1,100 executions took a total of 450 CPU seconds instead of 18,000 – and the batch completion time didn’t change.
The 1,100 executions were performed by a set of background processes – they took some of the output from the batch and thrashed it ready for the next day’s batch. So when I had asked “is this code in the batch run?” the answer was, quite correctly, “yes” – but the code wasn’t in the critical path of the batch!
The other two statements I have listed were the most expensive on the physical I/O list, although the very simple update actually came 4th in order of elapsed time.
I fixed the insert/select statement by adding an index and then supplying a stack of hints to lock in an execution path that the optimizer was not going to derive from the gathered statistics (which were misleading). Execution time dropped to less than two minutes – which saved us enough time to hit the target, with about 10 minutes to spare.
My first fix on the update was to add the predicate “where flag != 0″. It’s surprising how many times a simple detail like that is missed. Now, instead of updating about 5 million rows per night, we update a few hundred. We still do a very large tablescan to find those rows – but in the near future I will be adding a function-based index that identifies exactly the rows that are waiting to be changed and wastes no space indexing rows that don’t need to be changed. The addition of the predicate reduced the update time by about 10 minutes. The index should drop it down to a few seconds – leaving us well within the target time.
The direct impact of fixing the insert/select was a reduction in (a) db file scattered reads – because I was using suitable indexes, and (b) direct path reads and direct path writes - because I wasn’t doing a massive hash join an more. An indirect benefit was a small reduction in buffer busy waits (of the type that 10g calls “read by other session”)
The direct impact of fixing the update was a reduction in log file parallel writes and log file syncs (because of the 5,000,000 rows that were no longer being updated – worth about about 400MB of redo). With the introduction of the index, we’ll then have another significant drop in db file scattered reads.
There are several more wins to be had from “expensive SQL” – but the remaining heavy hitters are only taking 200 – 400 seconds each, so return on investment is going down, and further investigations may be postponed so that other more pressing work can be undertaken.
The moral of the story: statspack can help you spot the heavy hitters, but sometimes you really need to know which items are in the critical path.
Of course, you may spot items that are off the critical path but are likely to be affecting the performance along the critical path, and it’s always worth picking up things that have no impact on the critical path if the fix is quick, easy and safe and the saving in resources is significant.

I had a similar situation where it was critical that the “overnight batch processing” completed in 1 hour. This was a new system being put in (replacing 4 SQL servers with a single Oracle server) but because of the operational requirements (Bank), if there was an error in the batch, it had to be restartable and STILL complete within in 1 hour. Therefore, it had to complete in 30 minutes. Gulp.
But another operational requirement was that the system had to support 100% volume growth without hardware changes. So my batch window shrinks to 15 minutes!
This is where you start going very pale… 2 million accounts to recalculate in 15 minutes.
The good news was, exactly as you found, to remove as much as possible from the “critical path” and reduce the amount of processing required (only update the records that need updating… only calculate interest charges on accounts where the balance has changed – otherwise use last nights calculation)
The result was, with careful SQL analysis and profiling (for which Jonathan, Tom, Conor and Cary can take credit as I must have read every article written by them all!) we were able to get the critical path processing to around 8 minutes even on the busiest days without compromising data integrity, operational requirements or having to use any undocumented tricks.
Comment by Guy — December 15, 2006 @ 5:09 pm UTC Dec 15,2006 |
Just out of curiousity, was this a 4 processor system? Or 4 core system?
Comment by kevinclosson — December 15, 2006 @ 8:29 pm UTC Dec 15,2006 |
Kevin, I don’t know, I didn’t check – I assumed they meant 4 processor.
Comment by Jonathan Lewis — December 15, 2006 @ 9:39 pm UTC Dec 15,2006 |
Jonathan,
What do you mean “Critical Path” here ? I did not get that.
Regards,
Prem
Comment by Prem — December 21, 2006 @ 1:09 pm UTC Dec 21,2006 |
Prem, as an example of the concept: assume I have 10 tasks to complete in the night, but four of them have to run in the right order. It doesn’t really matter if I can make the other six jobs run in a few minutes – the time to complete the batch is the time it takes to run the four jobs one after the other. That sequence of jobs is the critical path.
Comment by Jonathan Lewis — December 21, 2006 @ 7:35 pm UTC Dec 21,2006 |
You mention using hints.
Is it safe, in these days of 10g, to say that a hint is, in reality, a comand that Oracle will obey; or, is there still a chance that it will be ignored?
Comment by John — May 10, 2007 @ 10:56 am UTC May 10,2007 |
John: a “hint” has always been a command that Oracle should obey.
However: hints are badly documented and very few people know how any of them really work; there are some real bugs that cause them to be lost; and Oracle has changed the syntax (without documentation) for some hints in 10g.
Bottom line – if you know what you are doing, hints will always be obeyed unless you happen to hit a bug.
Comment by Jonathan Lewis — May 10, 2007 @ 3:35 pm UTC May 10,2007 |
[...] parole, ad un articolo di Don che fra l’altro riporta ampi tratti di un’altro articolo (http://jonathanlewis.wordpress.com/statspack-distractions/) fra l’altro, cosa che mi lascia basito, senza inserire un link alla fonte (sembrerebbe che [...]
Pingback by Corretta informazione « Oracle and other — August 27, 2010 @ 8:46 am UTC Aug 27,2010 |
[...] final thought – it’s easy to fix the wrong thing when you’re looking at statspack reports. Although we can see some SQL that is [...]
Pingback by Analysing Statspack 13 « Oracle Scratchpad — January 7, 2013 @ 6:46 pm UTC Jan 7,2013 |