Oracle Scratchpad

February 1, 2011

Bug

Filed under: Troubleshooting — Jonathan Lewis @ 8:48 pm BST Feb 1,2011

There’s a note on the OTN database forum that’s worth reading if you’re puzzled by some of the Statspack or AWR results in 11g. The author is talking about 11.2.0.2 and walks through a couple of inconsistent reports from Statspack – but this prompted me to take a look at some underlying details using version 11.1.0.6 – and I think there’s a bug (or change) in the way that enqueues (locks) are handled that can make both statspack and AWR reports “lose time”.

Essentially the following sequence of events is possible:

    Session A starts waiting on a TM lock
    after a few seconds v$system_event records a wait for “enq: TM – contention”.
    AWR (or statspack) takes a snapshot (which includes v$system_event).
    Some time later session A acquires the TM lock – the total_waits for this event does not get incremented in v$system_event.
    AWR (or statspack) takes the next snapshot

When you run the AWR/Statspack report across the interval, the “Top N Timed Events” and the other sections on wait events will show time waited only if the total_waits for the event has changed in the interval. But we incremented the count before the first snapshot, and haven’t incremented it since – so the time we spent waiting in this interval disappears. (The time we spent waiting in the previous interval will be in the report for the previous interval, of course.)

Note – this is not the same as the inconsistencies that used to appear in statspack with statement-level errors, where the total resource usage across multiple periods would be reported against the snapshot where the statement completed. In those cases we finally saw the total resource usage even though it was summed against one interval; in this case we lose track of any time spent waiting after the first snapshot.

Of course the problem is not a total whitewash. The time may appear in other parts of the report – for example it will show up in the “DB Time” total – and the results will only be seriously deceptive for waits that are (a) long and (b) are so unluckily synchronised that an interval with a lot of wait time in it doesn’t have a wait event start in the interval and increment the total_waits. The problem may, therefore, only appear as a slight inconsistency in most systems.

Footnote: I haven’t spent much time looking at this, so I don’t know which versions are affected, nor which enqueues the new mechanism applies to (it may only be “TM” enqueue, as other waits tend to time out and restart every few seconds). This note is just an early warning to help you recognise the situation if you ever come across it.

2 Comments »

  1. [...] This post was mentioned on Twitter by blogs of the world, SQL Manager. SQL Manager said: OraScratchpad. Bug: There’s a note on the OTN database forum that’s worth reading if you’re puzzled by some of the… http://dlvr.it/FbN52 [...]

    Pingback by Tweets that mention Bug « Oracle Scratchpad -- Topsy.com — February 4, 2011 @ 4:39 am BST Feb 4,2011 | Reply

  2. [...] Lewis once again identified that there’s a bug (or change) in the way that enqueues (locks) are handled that can make both [...]

    Pingback by Log Buffer #213, A Carnival of the Vanities for DBAs | The Pythian Blog — February 4, 2011 @ 4:54 pm BST Feb 4,2011 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,173 other followers