Oracle Scratchpad

January 10, 2007

Event Histograms

Filed under: Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 8:44 pm BST Jan 10,2007

[Forward to Part 2]

One of my favourite “little additions” in 10g is the v$event_histogram view. From a very short report it produces an extremely useful addition to the information about lost time. In v$system_event, we can get a report about the average wait time for an event – but when you condense a lot of data into a single number you always lose critical information. v$event_histogram addresses that problem.

The view is defined as:

Name                    Null?    Type
----------------------- -------- ----------------
EVENT#                           NUMBER
EVENT                            VARCHAR2(64)
WAIT_TIME_MILLI                  NUMBER
WAIT_COUNT                       NUMBER        

And here’s an example of the simplest query, with a couple of result sets, that you could use against it:


select
        *
from	v$event_histogram
where	event = 'db file sequential read'
order by
        wait_time_milli
;            

    EVENT# EVENT                     WAIT_TIME_MILLI WAIT_COUNT
---------- ------------------------- --------------- ----------
       115 db file sequential read                 1     230721
       115 db file sequential read                 2     161702
       115 db file sequential read                 4    4254007
       115 db file sequential read                 8   14293447
       115 db file sequential read                16    4148671
       115 db file sequential read                32    1683639
       115 db file sequential read                64     571850
       115 db file sequential read               128      40053
       115 db file sequential read               256       1114
       115 db file sequential read               512         70
       115 db file sequential read              1024         23
       115 db file sequential read              2048         20            

EVENT# EVENT                     WAIT_TIME_MILLI WAIT_COUNT
---------- ------------------------- --------------- ----------
       115 db file sequential read                 1      11643
       115 db file sequential read                 2        144
       115 db file sequential read                 4        290
       115 db file sequential read                 8        779
       115 db file sequential read                16       3162
       115 db file sequential read                32       2229
       115 db file sequential read                64       1445
       115 db file sequential read               128       1001
       115 db file sequential read               256        382
       115 db file sequential read               512         38
       115 db file sequential read              1024         15
       115 db file sequential read              2048          5            

As you can see from the results, the wait_time_milli column keeps doubling up.

The first sample is telling me that: 230,721 reads took less than 1 millisecond; 161,702 reads took between 1 and 2 milliseconds; 4,254,007 reads took between 2 and 4 millisconds, and so on. Rather than having just one average read time, I now have a small spread of figures that can give me a much better idea of what’s going on.

The database that produced the first outpu had been up for about 8 hours, and the time range is really too large for you to make a good guess about what’s going on – but a huge fraction of the reads are in the 4 – 16 millisecond band, which suggests that the disks are doing a lot of work but are coping reasonably well. 

However, you do see a very long tail with some very slow reads (up to 2 whole seconds). If you had a few snapshots at finer granularity, you would see the disks running close to capacity and every few minutes a small peak in the load pushing them into a breif catastrophe. This site is operating in the knee of Cary Millsap’s (in)famous response-time curve.

In the second data set – an “accident” I engineered on my laptop a few minutes after start up – you can see a pattern which shows the real power of the histogram information.

On average, the disk time (which I collected from v$system_event) didn’t seem to be too bad – it was a little slow at 16 millisecond. But when you look at the histogram you can see that a large fraction of the reads are taking much more than 16 milliseconds, and some are (again) tailing all the way out to 2 seconds (but 5 reads out of 21,000 is a lot more significant than the 20 out of 28,000,000 that we saw in the first set of results).

You never get perfect information – but it’s amazing how ten or fifteen lines of data about a wait event can give you so much more than just that one crunched number in v$system_event.

What you have here is a lot of reads that are putting the I/O subsystem under stress, but the file-system buffer – or some other clever caching component in the stack down to the spinning discs – is hiding the problem by serving up a lot of the reads from memory.

This emulates a scenario I have seen quite often. Some very inefficient SQL (often over-heavy on index usage) is recycling blocks very quickly, damaging the Oracle cache for everyone else but managing to revisit blocks just fast enough to recover them from a secondary buffer.

Sometimes one person is killing the system, and everyone else is complaining, when “on average” the disk drives look quite healthy. Now you can produce a summary report that explains the apparent contradiction.

[Forward to Part 2]

10 Comments »

  1. Yes, this is one of the best features of 10g. I always look for that “double-hump” in the histogram to indicate possible problems. Interpreting it is sometimes a tricky business of course.

    I’ve seen a triple-hump effect on direct path reads, with the cause being too-high a default degree of parallelism on a generally quiet system. The low wait times were queries running as the only active session on the system, the middle one was normal operations, and the highest one being a system swamped with too many PQ processes leading to a drastic reduction in throughput.

    Comment by David Aldridge — January 11, 2007 @ 3:20 pm BST Jan 11,2007 | Reply

  2. Can you explain double-hump and triple-hump effect and how do they indicate a possible problem.
    Thanks !

    Comment by Rita — February 6, 2007 @ 4:22 am BST Feb 6,2007 | Reply

  3. A Statspack report reports data from v$event_histogram while AWR does not on 10g.

    Comment by Dude — February 8, 2008 @ 4:10 pm BST Feb 8,2008 | Reply

  4. great google hit! I was really looking for information on column histograms; but, this was so much more interesting. With 64 different events, it reports a LOT of information. I’m only sorry I found it AFTER I rebooted the host yesterday. Had I only know :(

    Thanks for posting!

    Comment by David Russell — May 3, 2008 @ 12:13 am BST May 3,2008 | Reply

  5. David,
    Thanks for the comment – amongst other things it’s brought to my attention that I didn’t answer Rita’s question in comment #2.

    There is an example of this in the second posting about event histograms, where we see a large volume of data centred around the 8 m/s mark, and a second large volume of data with a much slower response time centred somewhere near the 1 second mark. This resolved into two different SANs with different performance patterns. That’s what I mean by a “double hump”.

    Comment by Jonathan Lewis — May 3, 2008 @ 10:06 am BST May 3,2008 | Reply

  6. [...] Filed under: Statspack, Troubleshooting — Jonathan Lewis @ 8:38 pm UTC Jan 31,2007 In an earlier blog, I mentioned the 10g view v$event_histogram and described how useful it could [...]

    Pingback by Event Histograms - 2 « Oracle Scratchpad — June 16, 2008 @ 9:10 am BST Jun 16,2008 | Reply

  7. An update on this on 11gR2

    Now we have an additional column called last_update_time which is very useful info

    santhosh

    Comment by Santhosh Nair — April 27, 2010 @ 6:30 pm BST Apr 27,2010 | Reply

  8. [...] You could get more detailed information thanks to the gv$event_histogram view (benefit of this view is described in this post). [...]

    Pingback by Measure oracle real-time I/O performance « bdt's oracle blog — November 20, 2012 @ 8:30 am BST Nov 20,2012 | Reply

  9. […] although the figures are close, a quick check on the v$event_histogram ranges ending 4096, 8192, 16384 and 32767 don’t quite match the precision figures given in […]

    Pingback by 12c I/O debug | Oracle Scratchpad — July 2, 2013 @ 6:06 pm BST Jul 2,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,453 other followers