Oracle Scratchpad

December 9, 2014

Parse Time

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 11:53 pm GMT Dec 9,2014

Here’s a little query I wrote some time ago to see where my time went while running a query. It’s nothing sophisticated, just one of those simple things you can do with v$active_session_history (or dba_hist_active_sess_history, if you don’t get to the crime scene in time).


set null CPU

select
        sql_exec_id, in_parse, in_hard_parse, event, count(*)
from
        v$active_session_history
where
        sql_id = '{your choice here}'
group by
        sql_exec_id, in_parse, in_hard_parse, event
order by
        sql_exec_id, in_parse, in_hard_parse, count(*)
;

SQL_EXEC_ID I I EVENT                                            COUNT(*)
----------- - - ---------------------------------------------- ----------
   40341649 N N db file sequential read                                68
   40341649 N N CPU                                                    21
   40341649 N N db file scattered read                                  9
            Y Y CPU                                                     7

I had run the query that I was tracking exactly once, but my ASH query allows for, and separates, multiple executions of the same query by summing on sql_exec_id (the thing that the SQL Monitor also uses). The last row looks a little odd, though: it does’t have a value for sql_exec_id; that’s because those are ASH samples when the query is being optimized, not being executed – note that I’ve reported the columns in_parse and in_hard_parse – and both are set to “Y” for that row.

So  (statistically speaking) it’s probably taken about 7 CPU seconds for Oracle to optimise the statement, and from the rest of the results you can see that it’s taken about 21 CPU seconds to run, with 68 seconds spent on random I/Os and 9 seconds spent on multiblock reads for a total of 103 seconds elapsed.

Seven seconds sounds like quite a lot of time for parsing – but it was a fairly complex statement. However, the reason I’d been running the statement on a test system (a fairly good clone of production) was that I’d been seeing something stranger on production and I needed to get a baseline on the test system before I starting trying to fix the problem. Here’s the equivalent ASH reports for the same statement when it had run on production at a time that allowed me to capture its ASH samples.


SQL_EXEC_ID I I EVENT                                            COUNT(*)
----------- - - ---------------------------------------------- ----------
  514257929 N N CPU                                                    21
  514257929 Y Y latch: row cache objects                                1
            Y Y CPU                                                   119

Note the 119 CPU seconds spent parsing to run a 22 second query ! But that wasn’t the worst of it – sometimes the results looked more like this:


SQL_EXEC_ID I I EVENT                                            COUNT(*)
----------- - - ---------------------------------------------- ----------
  523748347 N N db file sequential read                                 1
  523748347 N N CPU                                                    32
            Y Y resmgr:cpu quantum                                     11
            Y Y latch: row cache objects                               38
            Y Y CPU                                                   415

That’s a pretty catastrophic optimsation time – especially since the statement can be optimised in seven seconds in another environment. You might note the resource manager kicking in there, the session is exceeding the CPU limit set for its resource group – though not very often given how infrequently it seems to be waiting on “resmgr:cpu quantum”. But there’s another important wrinkle to this report – which you can see when compare v$active_session_history with v$sql.


  1  select
  2     sql_id,
  3     round(cpu_time/1000000,2) cpu,
  4     round(elapsed_time/1000000,2) ela from v$sql
  5  where
  6     sql_text like '{some identifying text}'
  7* and        sql_text not like '%v$sql%'
SQL> /

SQL_ID               CPU        ELA
------------- ---------- ----------
2atyuc3vtpswy     285.81     516.13

The figures from v$sql don’t match very well with the summed results from ASH which has a total sample of 497 seconds and a CPU sample of 447 seconds. I think I can live with a statistical error of 4% ((516-497)/516) in a random sample for total time, but how do you explain the 36% error in the CPU time ?

The samples reporting “resmgr:cpu quantum” are a clue: the machine is overloaded; it’s trying to use far more CPU time than is available. As a result a process that gets pushed off the CPU by the operating system scheduler while it’s running can spend a long time in the run queue waiting to start running again. And if it’s an Oracle process that got pre-empted it doesn’t “know” that it’s not running, it didn’t put itself into a wait state so all it “knows” is that it’s not in a wait state.

So how do ASH and v$sql differ ? The code that derives the cpu_time for v$sql issues a call to the O/S asking “how much CPU have I used”. The code that takes an ASH sample says: “is this session active, if so is it in a wait state and if it’s not in a wait state then it’s either on the CPU or in the run queue waiting for the CPU”. So when we compare v$sql with ASH the difference in CPU is (statistically speaking) time spent in the run queue. So of our 447 seconds of CPU recorded by ASH, we spent 161 seconds in the CPU run queue waiting for CPU.

We still have to account for the difference between the 7 CPU seconds on a test system and the variation between 119 CPU seconds and 415 CPU seconds in optimisation on production. In a word – concurrency. Apart from everything else going on at the time there were, in the worst case, 16 slightly different versions of the same statement being called at the same time (on a machine with 24 cores) – all 16 statement were competing violently for the same resources at the same time, and the escalating conflict as more session joined in produced an exponential growth in time spent competing for resources rather than doing the job. (I managed to demonstrate the effect quite nicely during the evening by limiting the batch to 4 concurrent executions – and got a typical parse time of 40 CPU seconds).

I’ve often warned people about the problems of concurrency and encouraged them to think about how much time is being spent in competition rather then doing the job; I think this is the most extreme case I’ve seen in a production system. Given how dramatic the variation is, I can’t help wondering if the problem has been exaggerated by some corner case of sessions spinning for mutexes or latches; perhaps even an error in the code that allows resource management to put a session into “resmgr:cpu quantum” while it’s holding a latch or mutex. (I wasn’t able to emulate such an extreme display of the problem on a slightly newer version of Oracle, but I was able to construct a test that demonstrated the effect with a much smaller wastage of CPU.)

The solution (almost certainly): the statements are extremely similar, varying in just one predicate that is using a literal constant. It ought to be a relatively safe and simple change to make the query use a bind variable in that predicate. If the solution is adopted I’d expect to see the (once only) parse time on production drop back to about 7 seconds. Of course, if any other session tries to call the same statement at the same time it ought to end up reporting 7 seconds waiting on “cursor: pin S wait on X” before it starts executing – but that 7 seconds wait is a lot better than an extra 493 CPU seconds trying to optimise the “same” statement at the same time.

Footnote:

Running a minor variation on my ASH query to report the sql_plan_hash_value along with the sql_exec_id, I found that the “fix-up” code that updates older ASH rows with information that only becomes available later (e.g. plan_hash_value when optimising, or long time_waited values for a single wait event) only goes back 255 rows – so when I queried ASH for the statements that took 500 seconds to optimizer only 255 of the in_parse rows showed the final sql_plan_hash_value.

 

11 Comments »

  1. Jonathan,

    Thanks for the blog and well documented example about the CPU usage.

    Thanks
    Aswath

    Comment by Aswath Rao — December 10, 2014 @ 2:11 am GMT Dec 10,2014 | Reply

  2. There is small caveat to this sql, if there are multiple child cursor to the a sql or multiple hard parses for the same sql with only single child cursor then there is no way to identify the each hard parse. I think, there is no way in ASH to identify the each hard parse although each execution can uniquely be identified by sql_exec_id.

    select
            sql_exec_id, in_parse, in_hard_parse, event, count(*)
    from
            v$active_session_history
    where
            sql_id = '{your choice here}'
    group by
            sql_exec_id, in_parse, in_hard_parse, event
    order by
            sql_exec_id, in_parse, in_hard_parse, count(*)
    ;
    

    Regards,
    Vipan

    Comment by vipankumarsharma — December 10, 2014 @ 6:49 am GMT Dec 10,2014 | Reply

    • Vipan,

      Thanks for the comment.

      You’re right – given that samples where the statement is still being optimized don’t show an exec id (or, I’ll guess, an exec start time) my query could show multiple exec_ids for different executions of the statement but sum the parse-related samples across all the hard parses that had taken place.

      I knew that I didn’t have to worry about this problem, which is why it didn’t think about it when writing the post. That’s a nice demonstration of how hard it is to write generic articles based on specific experiences; no doubt if I had actually hit the problem you had described I would have decided on the most appropriate workaround (possibly based on sample_time) and written a longer article – and still missed a variant that someone else might have seen.

      Footnote: I had an email chat on the Oak Table about another anomaly in ASH when you get a very long parse time (the plan_hash_value only gets copied back 255 rows into history), maybe the SQL_EXEC_ID needs to be copied back into the triggering parse call as well.

      Comment by Jonathan Lewis — December 11, 2014 @ 9:21 am GMT Dec 11,2014 | Reply

      • I noticed more discrepancies between v$sql and v$ASH.I ran same SQL three times and v$sql is spot on with the information about hard parses and parses. But V$ASH is not.

        SQL> @sqlid 7mp1avc8vpxc9 %Show SQL text, child cursors and execution stats for SQLID 7mp1avc8vpxc9 child %
        
        HASH_VALUE PLAN_HASH_VALUE  CH# SQL_TEXT
        ---------- --------------- ---- ---------------------------------------------------------------------------------------
         297465225       405148644    0 select max(id) from t4 where status='P'
         297465225       405148644    1 select max(id) from t4 where status='P'
        
        
         CH#  PLAN_HASH     PARSES   H_PARSES FIRST_LOAD_TIME     LAST_LOAD_TIME      LAST_ACTIVE_TIME  EXECUTIONS    FETCHES ROWS_PROCESSED    ELA_SEC       LIOS       PIOS      SORTS USERS_EXECUTING
        ---- ---------- ---------- ---------- ------------------- ------------------- ----------------- ---------- ---------- -------------- ---------- ---------- ---------- ---------- ---------------
           0  405148644          1          1 2014-12-11/13:09:40 2014-12-11/13:09:40 12/11/14 13:10:20          1          1              1  40.153674    1201320     600890          0               0
           1  405148644          2          1 2014-12-11/13:09:40 2014-12-11/13:11:40 12/11/14 13:20:20          2          2              2  61.301935    2400306    1200416          0               0
        
        
        SQL> select sql_id,sql_exec_id,SQL_EXEC_START,sql_child_number,IN_PARSE,IN_HARD_PARSE,COUNT(1) from V$ACTIVE_SESSION_HISTORY where sql_id='7mp1avc8vpxc9' group by sql_id,sql_exec_id,SQL_EXEC_START,IN_PARSE,IN_HARD_PARSE,sql_child_number;
        
        SQL_ID        SQL_EXEC_ID SQL_EXEC_START     CH# I I   COUNT(1)
        ------------- ----------- ----------------- ---- - - ----------
        7mp1avc8vpxc9    16777217 12/11/14 13:11:40    1 N N         30
        7mp1avc8vpxc9    16777216 12/11/14 13:09:42    0 N N         38
        7mp1avc8vpxc9                                  0 Y Y          2
        7mp1avc8vpxc9    16777218 12/11/14 13:19:49    1 N N         31
        
        

        As you can see above, V$ASH , the parsing information is only associated with the SQL_ID, not with SQL_EXEC_ID & SQL_EXEC_START. So we cannot determine which execution/time of the SQL spent more time on parsing, if the SQL is executed more than once.

        Antony

        Comment by antony — December 11, 2014 @ 7:08 pm GMT Dec 11,2014 | Reply

        • Antony,

          Your closing comment seems to be restating Vipan’s comment.

          Two things about your particular set of results, though:

          The “in_parse” rows seem to be linked with the sql_id and child number, so it looks as if the first hard parse took around 2 seconds, but the second was too fast and wasn’t captured in ASH at all. We could cross-check this by looking at the sample_time for two rows just to ensure that they were consecutive rather than being separated by the two minutes indicated by v$sql.last_load_time and v$active_session_history.sql_exec_start.

          The time recorded in the two views is a good match: 40 seconds elapsed for child zero in v$sql against 38 + 2 samples in v$active_session_history; 61 seconds elapsed for child 1 in v$sql against 61 samples in v$active_session_history. (The figures also tend to corroborate the correctness of the allocation of in_parse rows to the right child.)

          Comment by Jonathan Lewis — December 12, 2014 @ 11:43 am GMT Dec 12,2014

  3. I always use SQL_EXEC_START as well as SQL_EXEC_ID because in the past I have found SQL_EXEC_ID to be non-unique/unreliable on its own, for reasons unknown/uninvestigated.

    The documentation seems to back this up:
    https://docs.oracle.com/cd/E18283_01/server.112/e17110/dynviews_3052.htm

    Together, the three columns SQL_ID, SQL_EXEC_START, and SQL_EXEC_ID represent the execution key. The execution key is used to uniquely identify one execution of the SQL statement.

    Comment by Dom Brooks — December 10, 2014 @ 10:54 am GMT Dec 10,2014 | Reply

    • Yes, it happens across DB bounces.

      Comment by vipankumarsharma — December 10, 2014 @ 12:15 pm GMT Dec 10,2014 | Reply

    • Dom,

      I’ll run up a test when I get a little time – but I’d guess that since the sql_exec_id isn’t copied back to the optimisation rows then the sql_exec_start_time wouldn’t be either.

      Comment by Jonathan Lewis — December 11, 2014 @ 9:23 am GMT Dec 11,2014 | Reply

      • I wasn’t referring to the anomalies you and Vipan have discussed above but just about generic querying/grouping by sql_exec_id – it may well be only because of the restarts Vipan has mentioned – not sure – but in which case, it’s most relevent to DBA_HIST table.

        Comment by Dom Brooks — December 11, 2014 @ 10:08 am GMT Dec 11,2014 | Reply

  4. Hello,
    I can’t understand how one can relate, for a given sql ID, the result of a count on v$ash where in_[hard_]parse = ‘Y’
    and the values of any column coming from v$sql[stats]… I thought there was no way to get at any parse-related stats
    of a statement thru the v$ash view (if only cause v$ash is historicised data, whereas v$sql is not).

    Over here a simple test to check these differences on a few sql IDs showed things like:

    SQL> select  parse_calls ,round(cpu_time/1000000,2), round(elapsed_time/1000000,2) from v$sqlstats where sql_id='2bur9vvtvqkkc'
      2  union all
      3* select count(*) ,0,0 from v$active_session_history where sql_id='2bur9vvtvqkkc' and (in_parse='Y' or in_hard_parse='Y')
    SQL> /
    
    PARSE_CALLS ROUND(CPU_TIME/1000000,2) ROUND(ELAPSED_TIME/1000000,2)
    ----------- ------------------------- -----------------------------
          92036                    465.84                       2459.07
              2                         0                             0
    

    or

    PARSE_CALLS ROUND(CPU_TIME/1000000,2) ROUND(ELAPSED_TIME/1000000,2)
    ----------- ------------------------- -----------------------------
            647                    450.23                         628.7
              2                         0                             0
    

    Are you sure it’s relevant to try to compare these values ?

    Thanks.
    Seb

    Comment by Seb — February 3, 2015 @ 3:42 pm GMT Feb 3,2015 | Reply

    • Seb,

      Since ASH has a sample rate of once per second it wouldn’t make sense to try using it to examine the details of a query that averages less than a second – so your first example is right out. Your second example is borderline in terms of the information it might give, but if the first_load_time for the query was inside the limit of the ASH history, and if the number of samples for in_parse was a large fraction of the number of parse calls I might suspect that a suspicious fraction of the total run time was spent in parsing, and I might examine the ASH data in detail to see whether I could confirm that suspicion and determine how, why, and when it was happening.

      As you can see from the original note and follow-on examples in the comments, though, this type of analysis is really aimed at statement that aren’t run frequently but take a long time when they do run.

      Comment by Jonathan Lewis — February 4, 2015 @ 1:55 pm GMT Feb 4,2015 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.