A recent (Jan 2013) post on the OTN database forum reported a performance problem on Oracle 9.2.0.6 (so no AWR), and posted a complete statspack report to one of the public file-sharing sites. It’s been some time since I did a quick run through the highlights of trouble-shooting with statspack, so I’ve picked out a few points from this one to comment on.
As usual, although this specific report is Statspack, the same analysis would have gone into looking at a more modern AWR report, although I will make a couple of comments at the end about the extra material that would have been available by default with the AWR report that would have helped us help the OP.
First, here are the Load Profile, and “Top 5″ parts of the report. To make sense of them, you also need to know that report covered a two-hour interval, and the machine had 12 CPUs; there’s also a little clue in the fact that this is 32-bit Oracle with a 600MB SGA (with the sga_max_size set to 1.6GB) running (alone) on a machine with 12GB of memory.
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 54,437.02 65,579.53
Logical reads: 2,053,336.52 2,473,626.15
Block changes: 383.39 461.86
Physical reads: 43,270.67 52,127.57
Physical writes: 46.63 56.18
User calls: 35.83 43.16
Parses: 25.53 30.76
Hard parses: 0.18 0.21
Sorts: 186.25 224.38
Logons: 0.08 0.10
Executes: 108.76 131.02
Transactions: 0.83
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time 74,717 84.80
latch free 181,788 8,101 9.19
db file scattered read 47,546,136 2,893 3.28
db file sequential read 23,092,208 2,006 2.28
buffer busy waits 1,812,920 193 .22
-------------------------------------------------------------
Working from the top down, the first thing we notice is that 2M logical I/Os per second is quite busy, and that 43, 000 blocks read per second is pretty amazing – especially since the number of user calls and executes isn’t particularly large.
A cross check to the Top 5 then shows that a huge fraction of the work done by the database is CPU. At 2 hours and 12 cores we have 12 * 2 * 3600 = 86,400 CPU seconds available, and we’ve used 74,700 of them – and that’s into Cary Millsap’s “knee of the curve” for queueing problems. Since this isn’t the AWR, of course, we don’t have the OS Stats available, so we can’t see if the rest of the machines CPU is being used up outside the database – but looking at the db file read figures I’ll bet it is. If you have extreme CPU usage in an Oracle system, then latch contention problems are very likely to show up – and that’s our next highest figure – AWR would probably tells us which latch but, again based on the db file read figures, I’ll guess it’s the cache buffers chains latches.
So we look at the db file read figures and do the arithmetic that the AWR would show us to work out that we’re seeing multiblock read times of 0.06 milliseconds, and single block read times of 0.08 milliseconds. These have to be coming from the file system cache to be this fast. (Even then I think we may also seeing a timing problem (perhaps a measurement bug) that’s managing to lose wait time.) First thoughts, then – are that there is some code that is so inefficient that it’s doing far more physical I/O than it should; on the plus side the excess memory (12GB) means we’re being protected from the worst effects of disk I/O, but the caching effect means we’re able to burn up CPU at an astonishing rate to do the work.
So let’s look at the SQL to see where the CPU and disk activity is likely to be going. Here’s extract from the SQL ordered by Gets, and SQL Ordered by Reads. (We don’t have SQL ordered by Time in Statspack, of course):
CPU Elapsd
Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
5,636,117,159 28 201,289,898.5 38.1 ######## ######### 835260576
Module: RAXTRX
INSERT INTO RA_INTERFACE_ERRORS (INTERFACE_LINE_ID, MESSAGE_
TEXT, INVALID_VALUE) SELECT INTERFACE_LINE_ID, :b_err_msg6, '
trx_number='||T.TRX_NUMBER||','||'customer_trx_id='||TL.CUSTOMER
_TRX_ID FROM RA_INTERFACE_LINES_GT IL, RA_CUSTOMER_TRX_LINES TL,
RA_CUSTOMER_TRX T WHERE IL.REQUEST_ID = :b1 AND IL.INTERFAC
511,626,129 20 25,581,306.5 3.5 8155.90 15045.56 205460129
Module: JDBC Thin Client
select TRANSACTION_BRANCH , BRANCH_ADDRESS_1 , BRANCH_ADDRESS_2
, BRANCH_PHONES , BRANCH_FAX , BRANCH_TIN , TRX_NUMBER , TRX_NU
MBER_DISP , TRX_DATE , TRX_DATE_DISP , BILL_TO_CUSTOMER_ID , BIL
L_TO_CUSTOMER_NAME , SHIP_DATE_ACTUAL_DISP , SHIP_TO_NAME , SHIP
_TO_ADDRESS1 , SHIP_TO_ADDRESS2 , SHIP_TO_ADDRESS3 , SHIP_TO_ADD
CPU Elapsd
Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value
--------------- ------------ -------------- ------ -------- --------- ----------
281,953,754 20 14,097,687.7 90.5 8155.90 15045.56 205460129
Module: JDBC Thin Client
select TRANSACTION_BRANCH , BRANCH_ADDRESS_1 , BRANCH_ADDRESS_2
, BRANCH_PHONES , BRANCH_FAX , BRANCH_TIN , TRX_NUMBER , TRX_NU
MBER_DISP , TRX_DATE , TRX_DATE_DISP , BILL_TO_CUSTOMER_ID , BIL
L_TO_CUSTOMER_NAME , SHIP_DATE_ACTUAL_DISP , SHIP_TO_NAME , SHIP
_TO_ADDRESS1 , SHIP_TO_ADDRESS2 , SHIP_TO_ADDRESS3 , SHIP_TO_ADD
94,281,089 8,550 11,027.0 30.3 2924.20 4550.22 3761220362
Module: JDBC Thin Client
SELECT DELV_ADDR.ADDRESS1 FROM APPS.OE_ORDER_HEADERS_ALL OE_HEAD
ERS , APPS.RA_SITE_USES_ALL DELV_SITE , APPS.RA_ADDRESSES_ALL DE
LV_ADDR WHERE TO_CHAR(ORDER_NUMBER) = :B1 AND DELV_SITE.SITE_USE
_ID(+) = OE_HEADERS.DELIVER_TO_ORG_ID AND DELV_ADDR.ADDRESS_ID(+
) = DELV_SITE.ADDRESS_ID
94,011,901 8,544 11,003.3 30.2 2919.47 4527.54 803969757
Module: JDBC Thin Client
SELECT COALESCE(DELV_ADDR.CITY, DELV_ADDR.PROVINCE) FROM APPS.OE
_ORDER_HEADERS_ALL OE_HEADERS , APPS.RA_SITE_USES_ALL DELV_SITE
, APPS.RA_ADDRESSES_ALL DELV_ADDR WHERE TO_CHAR(ORDER_NUMBER) =
:B1 AND DELV_SITE.SITE_USE_ID(+) = OE_HEADERS.DELIVER_TO_ORG_ID
AND DELV_ADDR.ADDRESS_ID(+) = DELV_SITE.ADDRESS_ID
93,981,254 8,551 10,990.7 30.2 2919.13 4533.69 4273350236
Module: JDBC Thin Client
SELECT DELV_ADDR.ADDRESS2 FROM APPS.OE_ORDER_HEADERS_ALL OE_HEAD
ERS , APPS.RA_SITE_USES_ALL DELV_SITE , APPS.RA_ADDRESSES_ALL DE
LV_ADDR WHERE TO_CHAR(ORDER_NUMBER) = :B1 AND DELV_SITE.SITE_USE
_ID(+) = OE_HEADERS.DELIVER_TO_ORG_ID AND DELV_ADDR.ADDRESS_ID(+
) = DELV_SITE.ADDRESS_ID
The first statement reported by Gets is responsible for 500 millions buffer gets in the interval (and even then that’s 40% of the total – so there may be a couple more big ones that have fallen out of the library cache in the interim, perhap an hourly report on the previous hour would catch them). Look at the table that the SQL is inserting into – RA_INTERFACT_ERRORS. We might hope that an error table wasn’t going to receive many errors – so perhaps this is just an example of a statement with an insanely bad execution plan; clearly one to look at. Trouble is, though, that unless you set statspack to run at level 6 you don’t capture the execution plans – unlike the AWR which capture them by default. The batch took a couple of days to complete, though, so it would have been possible either to change the snapshot level, or to pull the execution plans from memory.
Looking at the SQL ordered by Reads, we see another example of the top statement being much more expensive than anything following it – but there’s an odd little pattern that caught my eye as I scanned the list. The next three statements look very similar to each other, and have similar statistics, and if you add up their stats the answer is very close to the figures for the first statement. Moreover, the report says that the top statement is responsible for 90% of the work – I think the top statement is actually calling the next three, so we need to look at those statements to see how to make them efficient in order to fix the first statement.
You might note, by the way, the capitals and the bind variables (:B1) in the next three statements – I suspect that there is a PL/SQL function in the first statement that is being called too many times. (20 executions of the first statement, 8,500 of the other three). Perhaps a change in the execution plan of the main query has produced a change in table order which has resulted in some predicate – one with a pl/sql function in it – being tested much more frequently than usual.
A quick cross-reference to the instance activity is in order:
physical reads 311,462,252 43,270.7 52,127.6
prefetched blocks 240,396,452 33,397.7 40,233.7
table fetch by rowid 12,348,226,523 1,715,508.0 2,066,648.8
table fetch continued row 12,047,845,308 1,673,776.8 2,016,375.8
table scan blocks gotten 361,692,971 50,249.1 60,534.4
table scan rows gotten 7,787,418,493 1,081,886.4 1,303,333.6
table scans (long tables) 32,041 4.5 5.4
A huge fraction of the physical reads seem to be for prefetched blocks – that a good indication that they are for tablescans (or index fast full scans), and when we compare rowid access with tablescan access we see that there are a lot of “long” tablescans (maybe the expensive three queries are all doing tablescans of a long table, that would account for 25,500 of those reported), with with 360M blocks scanned. Apart from anything else, all those block scans will be using a big chunk of CPU. The AWR would show us which segments were being scanned, of course, as would statspack if we were running it at level 7.
I’ve included the table fetch by rowid and table fetch continued row. Normally we should be concerned by “continued fetches” – but in this case I’m prepared to overlook them temporarily: it’s possible that after the critical SQL is corrected we may still see that most of the fetches by rowid turn into continued fetches, and at that point it will be worth worrying; but at present, even though they’re probably responsible for a serious fraction of the CPU usage, those continued fetches may be for rows that we’re not going to be fetching once the SQL is doing the right thing.
The last little detail, just for confirmation, comes from the latch activity part of the report; I’ve just selected the busiest 4 latches:
Pct Avg Wait Pct
Get Get Slps Time NoWait NoWait
Latch Requests Miss /Miss (s) Requests Miss
------------------------ -------------- ------ ------ ------ ------------ ------
cache buffers chains ############## 0.0 0.0 0 577,860,644 0.1
cache buffers lru chain 157,333 0.3 0.0 0 687,799,033 0.8
multiblock read objects 114,602,621 0.6 0.0 12 0
simulator hash latch 811,998,194 0.0 0.0 2 0
Get Spin &
Latch Name Requests Misses Sleeps Sleeps 1->4
-------------------------- -------------- ----------- ----------- ------------
cache buffers chains ############## 12,262,452 180,671 0/0/0/0/0
multiblock read objects 114,602,621 719,477 115 719369/101/7/0/0
simulator hash latch 811,998,194 3,889 19 3872/15/2/0/
As you can see, and as we had guessed, they’re all about the buffer cache – doing multiblock reads and loading/dumping blocks from the cache; and almost all of the sleeps are on the cache buffers chains.
Next Steps
As a quick and dirty which might help a bit – put more of the sga_max_size into the buffer cache – it doesn’t happen automatically in 9.2 – and that may reduce the interaction between the O/S and Oracle and save a bit of latching and CPU.
Address the three statements doing all the physical I/O, and check the path for the statement that cause them to be executed. Fix the statement that has done 40% of the buffer gets. In all cases the problem may simply be an execution plan change – but perhaps the machine has been hiding an inherent performance problem and on this occasion a small change in data (a few more errors, perhaps) resulted in the excess work generating a massive increase in workload and contention.
We do know, however, that there is a lot of work done on buffer gets that didn’t show up in this report – so run off the report for the previous hour and see if we can find a couple more big hitters. Once we’ve sorted out the worst SQL, we might then want to see if the volume of “continued rows” is high – and find out why they’re appearing. It’s possible that it’s due to a table with more than 254 columns, of course, or a long column, but we’ll worry about that only when we need to.
One final thought – it’s easy to fix the wrong thing when you’re looking at statspack reports. Although we can see some SQL that is “obviously” overloading the machine, maybe the business critical task that has to complete promptly is simply hidden because it does (relatively) little work. Unless we can fix the big problems extremely quickly, we ought to take a little time to check with the business and the developers that what we’re looking at really is the code that’s running the critical business task.
Longer term, of course, upgrade to 9.2.0.8 at least and then start running Statspack at level 7 for better information. Then upgrade some more.