A recent thread on the Oracle newsgroup comp.databases.oracle.server started with the following tkprof summary from a trace file:
call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 110673 12.42 14.18 0 90 98 0 Execute 111732 39.95 51.05 18217 68627 22700 7215 Fetch 315496 439.51 498.31 83053 4060984 0 1433176 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 537901 491.88 563.55 101270 4129701 22798 1440391 Event waited on Times Max. Wait Tota Waited ---------------------------------------- Waited ---------- ------------ SQL*Net message from client 343929 54.32 87346.16 SQL*Net message to client 343905 0.00 0.75 SQL*Net more data to client 14559 0.01 1.14 db file sequential read 3491 0.44 32.84 log file sync 2159 0.98 23.78 latch free 1882 0.50 6.36 db file scattered read 6183 0.24 29.14 SQL*Net more data from client 52 0.00 0.00 buffer busy waits 2 0.01 0.01
The question that came with the output was as follows:
Some users complain about a Powercenter performance, I traced the sessions for an hour, please have a look at last section of the the output of tkprof below. It is a 32 processors Sun running 188.8.131.52, not dedicated to this application. Running the trace the box was in avg. 30% idle. I checked the plans and they seem to be OK, I know that these wait events *may* be idle. Do I have problems with the network in your opinion?
Now, it’s easy to be deceived by a summary from an entire hour – you get the usual problem of peaks being flattened out by averaging – and this example is even tougher because (according to the comments in the original post) it is proably a summary of an unknown number of sessions . But let’s pretend for the moment that this summary is representative of the entire period. What can we see:
Let’s take a first guess – since there are 87,000 seconds recorded for ‘SQL*Net message from client’ and only 3,600 seconds in an hour, let’s guess that this is the aggregate across about 24 sessions (87,000 / 3,600). And since the longest wait is only 54 seconds, these are busy little operators.
Most of the time is spent on CPU (492 seconds out of 563) – so, as far as the server is concerned, we should be looking for big wins on CPU somewhere. But that’s only 440 CPU seconds in one elapsed hour on a machine with 32 CPUs – which means about 4% of the total available CPU. (And the machine was only 30% idle – so if this is all the sessions then something else must be using the other 64%)
The number of blocks read from disk is reported at 101,270 – but the difference between CPU time and elapsed time is only 71 seconds, which makes the I/O speed look reasonably healthy.
The breakdown of disk read waits shows 3,491 single block reads in 32.84 seconds – a bit less than 1/100 secoond per read, which is perfectly reasonable – but the maximum wait time of 0.44 seconds is a little dramatic.
The breakdown of disk read waits also shows 6,183 multiblock reads in 29.14 seconds – a bit less than 5 milliseconds per read, and if you do the arithmetic you find that the average read size is 15.8 blocks: so that’s a pretty good time to read what is probably 128KB off disk – in other words there’s a big caching benefit coming in there somewhere. Again, the max wait time of 0.24 seconds is interesting.
We see 400 seconds of CPU but that’s quite a lot when you only have to deal with 4,000,000 logical I/Os – the CPU isn’t going into block acquisition, it’s going into row-processing – maybe there are some nasty pl/sql functions being called for every row in a tablescan.
Other time reported: 2,159 log file syncs for 23.78 seconds – if you pretend that each log file sync is mostly waiting for log writer to write with no extra time loss that just over 1/100 second per write, which is not unreasonable (in the absence of a hardware write cache). (At an estimated 24 users for an hour, that’s about two commits per minute per user – pretend one commit is one transaction, and that’s reasonably consistent with the longest SQL*Net wait time of 54 seconds if our users are all busy little bunnies). But again there is an anomaly implied by the max wait time of 0.98 seconds.
Just 6.36 seconds on 1,882 latch free waits – 3.4 milliseconds average – not really a nasty looking threat – and not entirely surprising given the 537,000 calls made to the database over the period. It might be useful to see some statistics about the latches to see if we had a latch with some extreme values for gets, misses and (particularly) spins. Yet again, though, there is an unusually high max wait time of 0.50 seconds.
The only other detail of note, possibly, is in the fetch statistics: 300,000 fetches, 1.4M rows fetched. There is some array fetching going on, but the average fetch size is quite small – maybe there’s a lot of single-row processing bringing the average down.
In agreement with the array fetching, we can see some waits for “SQL*Net more data to client”; this usually appears when there is more data to return in a fetch than will fit into a single packet (as set by the SDU – session data unit – in the SQL*Net configuration files). This usually happens with LOB, large row, or array, fetches – so perhaps we have a small number of large array fetches, and a lot of single row processing.
This brings us to the most important point in our guess work. We have (we think) about 24 users, doing one or two transactions per minute, and it takes 110,000 statement executions to do that work. We are seeing about 50 statements per transaction moving back and fore across a network. What we can see from this summary is how much time disappears between all those calls but not how it is distributed; and even if we go back to the trace file and look at every step of the file we still don’t know if that time is due to slow network traffic, or due to some strange behaviour out at the client machine.
So where is the problem ?
For the server, it’s probably somewhere in the SQL. Most of the time is spent in CPU – far more than required for the simple, efficient, handling of 4,000,000 blocks.
We managed to get through a lot of multiblock reads incredibly quickly, and one of the possible side-effects of multiblock reads is high CPU consumption to examine all that data. We also suspect that there might be some PL/SQL activity in the row-handling.
You might think that there’s a problem with the huge number of parse calls – but look at the time they record. (To be that quick most of them are probably coming through the session cursor cache, which makes them just as efficient as calls to held cursors and makes the count a totally spurious indicator of costs).
We have to take note of the extremes that do appear for wait times, though. It seems likely that the database is subject to intermittent bursts of interference from other activity on the same machine. Given the relatively small amount of work reported in the traec file, the extreme peaks of in the disk times suggest problems due to sharing a disk cabinet, and the peak in the latch wait time suggest intermittent CPU starvation exaggerating the effect of sleeps.
So is there a network problem ?
The summary can’t tell you; in fact the entire trace file couldn’t really tell you. But there is a tiny hint that there might be a network-related problem on the server side in the “SQL*Net more data to client” waits. We are trying to throw several data packets at the network adapter in response to a single fetch call, and there is a tiny lag. This might mean a problem with the adapter being overloaded – but it might simply be a problem of CPU starvation.
Given the way this application is probably doing something that is intermittently CPU aggressive, I’d guess on the latter at this point – but we have to remember that the machine is shared with other applications that may be doing much nastier things to the CPU and network.
We might decide that the client has an issue that needs to be addressed – the number of statements executed per transaction is a little high, and that might lead to a lot of CPU time ‘disappearing’ inside the client, but might could make the client seem very slow if the network really is busy – or the clients are at the opposite ends of the earth to the server.
This trace file summary really isn’t worth more than a ten minute inspection – despite the fact that I’ve just spent two hours typing up the things that crossed my mind in that ten minutes – since there is so much information we don’t have. I’ve only used it as a demonstration of the types of things you have to think of when presented with the initial information.
Bear in mind that we don’t know how many sessions have been collapsed into this trace file; and we don’t know if that was all the sessions active for the hour reported; moreover we don’t know if there were other sessions (i.e. not Powercenter) running against the same instance.
I also don’t know if Powercenter is client/server, or three-tier – and my comments are based on it being client/server. If it’s actually three tier, then the problem may lie somewhere between the application server and the ultimate end-user and the timing of incoming requests may be based on something that is completely invisible to Oracle.
Finally, remember the first words in the original posters question: “Some users complain…”; possibly some users target much more data and cause many more round-trips in the application code than others. All we have is a summary – we’ve lost sight of the extreme cases; and we don’t even have a quantitive statement of the problem. Are the users complaining about a five second response time, or a five minute response time, and what would be a “reasonable” response time.
Summaries like this are good for generating a few ideas very quickly – but once the ideas are in, don’t jump to conclusions; look for supporting evidence.