Quiz Night

April 27, 2017

Quiz Night

Filed under: Oracle,Performance,Troubleshooting — Jonathan Lewis @ 5:29 pm BST Apr 27,2017

If this is the closing section of the tkprof output from the trace file of a single end-user session that has a performance problem, what’s the most obvious deduction you can make about the probable cause of the problem, and what sort of action would you take next ?


OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse      296      0.06       0.10          0        651          0           0
Execute    440      1.39       2.24          7       4664          0         146
Fetch      345     29.38      48.27          0    1709081          0         346
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     1081     30.84      50.63          7    1714396          0         492

Misses in library cache during parse: 5
Misses in library cache during execute: 7

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                     497        0.00          0.00
  SQL*Net message from client                   496       27.03         50.35
  direct path read                                1        0.00          0.00


OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse   128199      6.94      11.46          0       2740          0           0
Execute 2274845    371.25     605.60         30   10031162          0       68200
Fetch   2225314     10.94      18.17          5     879297          0      577755
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total   4628358    389.14     635.23         35   10913199          0      645955

Misses in library cache during parse: 9701
Misses in library cache during execute: 134

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  direct path read                                1        0.00          0.00
  latch: shared pool                              3        0.00          0.00

12666  user  SQL statements in session.
  495  internal SQL statements in session.
13161  SQL statements in session.
********************************************************************************
Trace file: {concealed file name}.trc
Trace file compatibility: 11.1.0.7
Sort options: default

       1  session in tracefile.
   12666  user  SQL statements in trace file.
     495  internal SQL statements in trace file.
   13161  SQL statements in trace file.
    5266  unique SQL statements in trace file.
 39046623  lines in trace file.
     742  elapsed seconds in trace file.

There’s no absoutely right answer to the last question, by the way – there are a couple of variations in approach that are likely to get to the same result in the same time, and some of those variations might have made looking at the tail end data the 2nd or 3rd step in the sequence.

Suggested Answer

One of the skills of using your time effectively when trouble-shooting is the way you balance the time you spend noting the little details that might be relevant, and the time you then spend following up each detail to check for relevance and corroborating evidence. The amateurs might throw themselves into in-depth research on the first detail they notice; or if they’ve started with a short list of observations simply spend too much time on the first before moving on to the second.

An exercise such as looking at this tail-end of a tkprof file and talking about what you spot and what you might do is just a little exercise in how to pace yourself as you tackle a problem. So, from my perspective, here a couple of obvious starting points:

I said it was an end-user session complaining about performance – the last line of the file tells use that the elapsed time was 742 seconds, and the summary of recursive statement tells us there were 2.2 million executions. How can ANYTHING an end-user wants to do “quickly” require 3,000 executions per second for 12 minutes? I can ask that question because I know that “recursive” executions aren’t necessarily “sys-recursive”; SQL statements executed inside a PL/SQL block are also recursive. On top of that I can see, anyway, that there are 12,466 USER statements in the file – the user (or the user’s code) is doing something it almost certainly shouldn’t be doing. Those 2.2M executions are responsible for 10M buffer visits – which suggests a lot of single row processing (3 steps down an index, one to the table)?
Associated with the 13,161 statements in the trace file there are 9,700 misses in the library cache during parse – that means “hard” parsing, and probably means the user is getting through about 800 “new” statements per minute – but there are only 5,266 unique statements so part of the problem is that some recently used statements (or child cursors, at least) are being flushed from the library cache: again that suggests that they’re doing something wrong.
As one of the commentators pointed out – a detail that I hadn’t noted initially – the number of “current” buffers is zero. This isn’t a mini-batch updating the database, it’s just a report (done badly).

What would I do next ? If it wasn’t already immediately obvious I’d look for the statements that were responible for the very large numbers of executions; a call to tkprof with sort=execnt as a parameter would push the high execution counts to the top of the file. (The worst one had over 900,000 executions, the next 600,000). Then, having accounted for only 70% of the total 2.2M so far, I’d do a grep, sed, sort with uniq -c to find the statement which generated texts using concatenation instead of bind variables (and there were 3 of them, responsible for a few thousand variants each). Then I’d go and find the owner of the code and sit down with them to work out how it should be re-engineered.

Comments (15)

15 Comments »

I’ll hazard a guess. I saw recursive statements being where the process was spending most of us time, so I was initially thinking “triggers”… But, seeing that there’s zero in the “current” column, I would think that would rule DML out? So I might guess PL/SQL functions used in a select? (Totally taking a shot in the dark here) Next steps would be to tear apart the functions and see where bottlenecks are, and optionally see if the query could be written without functions.

Comment by Kaley — April 27, 2017 @ 7:06 pm BST Apr 27,2017 | Reply
- Kaley,
  
  Nice catch on the CURRENT, and SQL in PL/SQL has a high probability.
  
  Comment by Jonathan Lewis — April 28, 2017 @ 2:09 pm BST Apr 28,2017 | Reply
My guess is a delete command with not indexed FKs, so it has to perform recursive selects against children tables to validate integrity.

Comment by Bertran Saragusti — April 27, 2017 @ 7:34 pm BST Apr 27,2017 | Reply
- Bertran,
  
  Thanks for having a go, but it can’t be DML because of the absence of CURRENT
  
  Comment by Jonathan Lewis — April 28, 2017 @ 2:10 pm BST Apr 28,2017 | Reply
Hi,
1) SQL*Net message from client is very high –> (client is not processing data fast enough)
Solution:
????
2) A lot of recursive call –> number of hard parsing is 9701 out of 128199 (actually not so bad)
Reason: A huge number of partitions, maybe the reason for a huge number of recursive call
Solution:
– gather stats (dictionary)

I know that answer is wrong, I did my best

Thanks,
Andjelko Miovcic

Comment by andjelko miovcic — April 27, 2017 @ 11:03 pm BST Apr 27,2017 | Reply
- Andjelko,
  
  Definitely worth noting the hard parse count is high in this context.
  
  It would be interesting to create a table with a huge number of partitions (100,000 say) and see if starting from a cold library cache we could get a very large number of recursive executions. But just starting with “large number of executes” is enough to get started with – the tkprof can tell us what they were.
  
  Comment by Jonathan Lewis — April 28, 2017 @ 2:14 pm BST Apr 28,2017 | Reply
- HI
  Cursor handling is problem:
  Opening and closing cursor in loop or
  Prepare statement is in the loop or ….
  
  Thanks
  Andjeko
  
  Comment by andjelkoblog — April 29, 2017 @ 12:37 am BST Apr 29,2017 | Reply
>>most obvious deduction you can make about the cause of the problem
The amount of work done by recursive sql statements and the number of recursive statements (495) executed for total user sql executions (12666) (but I guess that is obvious…)

I have a feeling their is some correlation between number of internal sql statements (495) and wait events reported for user sql statistics (SQL*Net message to client 497 and SQL*Net message from client 496).
I was under impression that SQL*Net message from client is not considered in calculation of elapsed time in sql statistics. So “Elapsed times include waiting on following events” appears to be misleading.
Out of 742 elapsed seconds in trace file, about 685 seconds (635 for recursive and 50 for user) were spent on sql execution, which means 57 seconds spent on idle time, which accounts for “SQL*Net message from client” waits.
Most of the work is being done (and time is being spent) during EXECUTE phase of recursive sql statements. That would make me think it is DML but then not sure why so many FETCH calls. Could it be a DML with RETURNING INTO that causes this?

>>what sort of action would you take next
Look at user sql statements to find the root cause for recursive sql statements??

Comment by Narendra — April 28, 2017 @ 11:52 am BST Apr 28,2017 | Reply
- Narenda,
  
  Good to do the (unaccounted time) sanity check when things get a little extreme – we might have found that we had to do something very subtle to find lost time if we had seen a large discrepancy between file time and activity time; but when the first thing that stands out is “very large numbers of very small bits of activity” it’s almost certain that we’ve found a design error in the coding that needs to be fixed before we worry about fine tuning.
  
  Comment by Jonathan Lewis — April 28, 2017 @ 2:18 pm BST Apr 28,2017 | Reply
  - Jonathan,
    Aah… I missed the wood for the trees… again.
    Thank you for your update. I also incorrectly assumed that recursive must be sys-recursive & missed out the fact that there are zero current mode gets.
    
    Comment by Narendra — April 29, 2017 @ 6:14 am BST Apr 29,2017 | Reply
Ah… the lesser spotted trace file… rarely seen in the wild these days… beautiful plumage…
– partly due to ASH
– partly because it’s just too “difficult” or “undesirable” particular for large organisations to give access to such information to those who want it
– partly because the silo of responsibilities mean that many developers don’t know how to interpret the trace files (or write performant code in the first place) and that many DBAs don’t know the application code, don’t really know SQL and PLSQL and can’t get passed the first question on the tuning by checkbox list which says “is there a stale_stats = Y flag somewhere in the whole database, well it must be stats then”… sigh…

Comment by Dom Brooks — April 28, 2017 @ 5:58 pm BST Apr 28,2017 | Reply
- Dom,
  I thought you would have been “institutionalised” by now…LOL
  When DBAs are allowed only a fraction of the time needed to be spent on a database out of hundreds of databases & when stale stats turns out to be the root cause for performance in 99 out of 100 databases, it is harsh to point a finger at DBAs.
  Management decisions drive this culture… maybe because it earns them big bonuses… especially in large organisations
  
  Comment by Narendra — April 29, 2017 @ 6:27 am BST Apr 29,2017 | Reply
  - Unfortunately your comment adds weight to my point.
    Unless your database applications are typically developed by developers who do not know much detail about Oracle, stale stats are unlikely to be the root cause of performance in 99 out of 100 databases.
    Even the built-in job is there to take that out of the equation most of the time.
    
    Even if the stale stats flag = ‘Y’, correlation does not mean causation.
    
    Even if you calc stats and the problem goes away, does not mean it was stale stats.
    
    The default definition of stale stats is 10% data changed. So, if your stats say 1,000,000 rows rather than 1,100,000 rows… big deal!!
    
    Stale stats are over-hyped as a solution when typically they are a problem for
    – Highly dynamic tables quickly going from few rows to many and then back down to few
    – Out of range
    – Partition vs table level.
    
    If you recalculate stats then you change one input into a complicated model which includes stats (and within stats “complicated” things like histograms), sampled data, dates, peeked binds etc.
    Even then in any decent system, the probability is that it is much more likely to have been something like
    – ACS
    – cardinality feedback
    – uncontrolled histogram changes
    – just plain old different binds when binds where peeked.
    
    Only if you go through at least some of the mathematical input / output numbers can you prove whether it was stale stats.
    
    If in 99 out of 100 databases, stale stats really actually are the root cause for you, then perhaps you should consider changing default dynamic sampling levels.
    
    Comment by Dom Brooks — May 10, 2017 @ 9:30 am BST May 10,2017 | Reply
“If in 99 out of 100 databases, stale stats really actually are the root cause for you, then perhaps you should consider changing default dynamic sampling levels.” — If I had a dollar for every time I had to correct someone of the notion that a “stale” stat didn’t necessarily light the path to fixing a SQL problem – I’d be rich I tell you – rich, rich, rich. In my mind, the constant presence of “stale” statistics means that there is something wrong with the overall administration of that database.

Comment by Robert Freeman — August 29, 2017 @ 9:22 pm BST Aug 29,2017 | Reply
[…] Quiz Night 32 (April 2017): what pointers to you get from the summary lines of this tkprof output? […]

Pingback by Quiz Catalogue | Oracle Scratchpad — August 20, 2022 @ 3:13 pm BST Aug 20,2022 | Reply