One of my favourite “little additions” in 10g is the v$event_histogram view. From a very short report it produces an extremely useful addition to the information about lost time. In v$system_event, we can get a report about the average wait time for an event – but when you condense a lot of data into a single number you always lose critical information. v$event_histogram addresses that problem.
The view is defined as:
Name Null? Type
----------------------- -------- ----------------
EVENT# NUMBER
EVENT VARCHAR2(64)
WAIT_TIME_MILLI NUMBER
WAIT_COUNT NUMBER
And here’s an example of the simplest query, with a couple of result sets, that you could use against it:
select
*
from v$event_histogram
where event = 'db file sequential read'
order by
wait_time_milli
;
EVENT# EVENT WAIT_TIME_MILLI WAIT_COUNT
---------- ------------------------- --------------- ----------
115 db file sequential read 1 230721
115 db file sequential read 2 161702
115 db file sequential read 4 4254007
115 db file sequential read 8 14293447
115 db file sequential read 16 4148671
115 db file sequential read 32 1683639
115 db file sequential read 64 571850
115 db file sequential read 128 40053
115 db file sequential read 256 1114
115 db file sequential read 512 70
115 db file sequential read 1024 23
115 db file sequential read 2048 20
EVENT# EVENT WAIT_TIME_MILLI WAIT_COUNT
---------- ------------------------- --------------- ----------
115 db file sequential read 1 11643
115 db file sequential read 2 144
115 db file sequential read 4 290
115 db file sequential read 8 779
115 db file sequential read 16 3162
115 db file sequential read 32 2229
115 db file sequential read 64 1445
115 db file sequential read 128 1001
115 db file sequential read 256 382
115 db file sequential read 512 38
115 db file sequential read 1024 15
115 db file sequential read 2048 5
As you can see from the results, the wait_time_milli column keeps doubling up.
The first sample is telling me that: 230,721 reads took less than 1 millisecond; 161,702 reads took between 1 and 2 milliseconds; 4,254,007 reads took between 2 and 4 millisconds, and so on. Rather than having just one average read time, I now have a small spread of figures that can give me a much better idea of what’s going on.
The database that produced the first outpu had been up for about 8 hours, and the time range is really too large for you to make a good guess about what’s going on – but a huge fraction of the reads are in the 4 – 16 millisecond band, which suggests that the disks are doing a lot of work but are coping reasonably well.
However, you do see a very long tail with some very slow reads (up to 2 whole seconds). If you had a few snapshots at finer granularity, you would see the disks running close to capacity and every few minutes a small peak in the load pushing them into a breif catastrophe. This site is operating in the knee of Cary Millsap’s (in)famous response-time curve.
In the second data set – an “accident” I engineered on my laptop a few minutes after start up – you can see a pattern which shows the real power of the histogram information.
On average, the disk time (which I collected from v$system_event) didn’t seem to be too bad – it was a little slow at 16 millisecond. But when you look at the histogram you can see that a large fraction of the reads are taking much more than 16 milliseconds, and some are (again) tailing all the way out to 2 seconds (but 5 reads out of 21,000 is a lot more significant than the 20 out of 28,000,000 that we saw in the first set of results).
You never get perfect information – but it’s amazing how ten or fifteen lines of data about a wait event can give you so much more than just that one crunched number in v$system_event.
What you have here is a lot of reads that are putting the I/O subsystem under stress, but the file-system buffer – or some other clever caching component in the stack down to the spinning discs – is hiding the problem by serving up a lot of the reads from memory.
This emulates a scenario I have seen quite often. Some very inefficient SQL (often over-heavy on index usage) is recycling blocks very quickly, damaging the Oracle cache for everyone else but managing to revisit blocks just fast enough to recover them from a secondary buffer.
Sometimes one person is killing the system, and everyone else is complaining, when “on average” the disk drives look quite healthy. Now you can produce a summary report that explains the apparent contradiction.
Yes, this is one of the best features of 10g. I always look for that “double-hump” in the histogram to indicate possible problems. Interpreting it is sometimes a tricky business of course.
I’ve seen a triple-hump effect on direct path reads, with the cause being too-high a default degree of parallelism on a generally quiet system. The low wait times were queries running as the only active session on the system, the middle one was normal operations, and the highest one being a system swamped with too many PQ processes leading to a drastic reduction in throughput.
Comment by David Aldridge — January 11, 2007 @ 3:20 pm GMT Jan 11,2007 |
Can you explain double-hump and triple-hump effect and how do they indicate a possible problem.
Thanks !
Comment by Rita — February 6, 2007 @ 4:22 am GMT Feb 6,2007 |
A Statspack report reports data from v$event_histogram while AWR does not on 10g.
Comment by Dude — February 8, 2008 @ 4:10 pm GMT Feb 8,2008 |
Dude,
Correct – see https://jonathanlewis.wordpress.com/2007/01/31/event-histograms-2/
Comment by Jonathan Lewis — February 8, 2008 @ 5:18 pm GMT Feb 8,2008 |
great google hit! I was really looking for information on column histograms; but, this was so much more interesting. With 64 different events, it reports a LOT of information. I’m only sorry I found it AFTER I rebooted the host yesterday. Had I only know :(
Thanks for posting!
Comment by David Russell — May 3, 2008 @ 12:13 am BST May 3,2008 |
David,
Thanks for the comment – amongst other things it’s brought to my attention that I didn’t answer Rita’s question in comment #2.
There is an example of this in the second posting about event histograms, where we see a large volume of data centred around the 8 m/s mark, and a second large volume of data with a much slower response time centred somewhere near the 1 second mark. This resolved into two different SANs with different performance patterns. That’s what I mean by a “double hump”.
Comment by Jonathan Lewis — May 3, 2008 @ 10:06 am BST May 3,2008 |
[…] Filed under: Statspack, Troubleshooting — Jonathan Lewis @ 8:38 pm UTC Jan 31,2007 In an earlier blog, I mentioned the 10g view v$event_histogram and described how useful it could […]
Pingback by Event Histograms - 2 « Oracle Scratchpad — June 16, 2008 @ 9:10 am BST Jun 16,2008 |
An update on this on 11gR2
Now we have an additional column called last_update_time which is very useful info
santhosh
Comment by Santhosh Nair — April 27, 2010 @ 6:30 pm BST Apr 27,2010 |
[…] You could get more detailed information thanks to the gv$event_histogram view (benefit of this view is described in this post). […]
Pingback by Measure oracle real-time I/O performance « bdt's oracle blog — November 20, 2012 @ 8:30 am GMT Nov 20,2012 |
[…] although the figures are close, a quick check on the v$event_histogram ranges ending 4096, 8192, 16384 and 32767 don’t quite match the precision figures given in […]
Pingback by 12c I/O debug | Oracle Scratchpad — July 2, 2013 @ 6:06 pm BST Jul 2,2013 |
[…] it provides some idea about I/O sub-system the database is running upon. Read Jonathan’s https://jonathanlewis.wordpress.com/2007/01/10/event-histograms/ and its subsequent parts to find more about the […]
Pingback by Reproduce Wait Event Histogram report as in AWR/Statspack | Let's Connect by Oracle… — June 26, 2015 @ 4:13 pm BST Jun 26,2015 |