Oracle Scratchpad

August 9, 2007

Volatility

Filed under: Performance,Troubleshooting — Jonathan Lewis @ 9:12 pm BST Aug 9,2007

This blog entry started life as a reference to a page on my website [ed: now on the Wayback machine] describing some simple code to take session-based snapshots of the work done and time lost. I’ve finally managed  to find a bit of time to explain why I needed the code.

One of my clients calls me in from time to time to review the way things are going and to assist with new  issues. On a recent visit I was introduced to a group who were seeing an enormous degree of variation in the run times of some of their reports.

To get some sort of measure of how bad the problem was they had scheduled a specific report to run every hour on the hour for a week. It was an absolutely fixed report, no bind variables, no sysdate references, nothing that would cause an execution path to change (if the stats were kept consistent with the data). After a week they had run times varying between 30 seconds and 16 minutes. It was a terrific way to demonstrate their point.

I often hear questions like “why does this query run so slowly?”, or “why is the performance so bad?”. You can’t even begin to answer questions like these without at least some raw data – such as execution plans, wait states or workload statistics. So that was why I hacked my snapshot package together – we knew that we had to find the extra time but we had no information about where to look for it.

If you’ve read the web page you will know that all the package does is take a snapshot of the v$session_event and v$mystat data for the session and write it to a couple of tables for subsequent analysis. So all I had to do was add a couple of lines to the client’s report and then sit back for a few hours – with the following results.

Note – these are not actual results from the production system. I wasn’t able to copy them down from the machine; however, to highlight the issue, I have recreated reasonably accurate approximations of the most significant figures from memory.

In a good run, the workload (which was from a system running 10.2.0.2) looked like this:

NAME                                    VALUE
----------------------------------- ---------
session logical reads                 151,302
CPU used by this session                2,744
DB time                                 3,054
physical reads                              3       

NAME                         TOTAL_WAITS   WAIT_SEC     AVG_CS     MAX_CS
---------------------------- ----------- ---------- ---------- ----------
db file sequential read                3       0.02        .57          5       

The logical I/O was pretty flat at around 151,000 for every single run and most of the time in the good runs was spent on pure CPU with most of the activity against cached data blocks. The rather high CPU time was partly due to optimisation time and partly due to sorting. There was a fairly consistent ‘lost-time’ component of about 3 seconds difference between the CPU time recorded and the DB time recorded.

Compare this with the results from one of the slower runs.

NAME                                    VALUE
----------------------------------- ---------
session logical reads                 151,302
CPU used by this session                3,038
DB time                                19,924
physical reads                         79,184	       

NAME                         TOTAL_WAITS   WAIT_SEC     AVG_CS     MAX_CS
---------------------------- ----------- ---------- ---------- ----------
db file sequential read           79,184     166.29        .21          5       

Note how a lot of the session logical I/O has suddenly turned into (single block) physical reads. Almost all of the difference in time was due to the 166 seconds spent waiting for I/O. There were, however, a few extra seconds of CPU time in this run, which is not surprising  since reading blocks into Oracle’s cache requires a lot of latch activity to find free buffers followed by the work needed to re-arrange LRU lists and cache buffers chains.

And finally a very slow run.

NAME                                    VALUE
----------------------------------- ---------
session logical reads                 151,302
CPU used by this session                3,401
DB time                                88,858
physical reads                        139,623       

NAME                         TOTAL_WAITS   WAIT_SEC     AVG_CS     MAX_CS
---------------------------- ----------- ---------- ---------- ----------
db file sequential read          139,623     851.70        .61          5       

Again almost all of the extra time was I/O time with a little CPU thrown in for the buffer management. But look at the average wait times in the last two reports. Not only are we performing more physical read requests in the slowest run but we also see that the average wait for each read is nearly three times as long.

The performance of this report depends very strongly on the total system activity in the recent past.

If there has been a little conflicting activity in the recent past a fair amount of data can get kicked out of Oracle’s buffer; but with a little luck quite a lot of that data is still in the SAN cache, hence the average read times of 2.1 milliseconds (0.21 c/s) for the physical reads.

If there has been a lot of conflicting activity in the recent past, and if it’s still going on, then a lot of the required data has to be read from disc. But not only is it missing from Oracle’s cache it’s also missing from the SAN cache and the SAN is also busy doing work for other tasks. So more of the reads are real disk reads and they are going into a queue at the SAN – hence the average read times of 6.1 milliseconds (0.61 c/s).

On this site much of the day is spent reporting from a particular portion of recent data. But every now and again a batch load of new data appears. The batches are applied to a different portion of the data set and the batches may be small, medium or huge.

The performance of this report depends to an extraordinary degree on the size and timing of the most recent batch load.

In cases like this you may get lucky – especially if you are using partitioning – and be able to take advantage of multiple buffer pools. If your reporting requirement is about data set A and your loading requirement is about data set B it’s possible that creating a keep (or recycle) pool for data set B will allow you to isolate the worst of the damage.

It’s a bit of a long shot and there are side-effects to consider. Bear in mind that you are not using the extra pool “properly”, you’re only using it to split the buffer cache. Remember, also, that if you split the buffer cache into multiple pools that your “reporting cache” has just lost a chunk of its memory. So think carefully about the possible impact before you try it.

On the plus side, it’s (usually) easy to implement and back out; you need only apply a couple of resize commands to the buffer cache, and “alter … storage ()” commands to chosen objects. Do test the commands on a development system though – there are a couple of oddities with things like partitioned LOBs (for example) which may make the attempt to change the storage() clauses very slow, or even impossible to use unless you can find exactly the right syntax.

Footnote:

I wrote  an article a few months ago describing a possible threat when you enable both a keep and a recycle pool.

 

6 Comments »

  1. […] Tools, Performance, Cool stuff, Administration — tanelp @ 1:05 am Jonathan Lewis’es post inspired me to finally complete my version of the Oracle session performance snapper script, which […]

    Pingback by Oracle Session Snapper - real-time session-level performance stats for DBAs « Tanel Poder’s blog: Core IT for geeks and pros — August 19, 2007 @ 5:19 pm BST Aug 19,2007 | Reply

  2. One major consideration is the size of the table itself.
    I have some queries which vary in execution times similarly, but I cannot cache (KEEP) the table that is more than 4GB.

    Comment by Hemant K Chitale — September 16, 2007 @ 5:36 am BST Sep 16,2007 | Reply

  3. Hermant,

    One option for stopping the performance varying is to put the big table into a “relatively small” RECYCLE pool, so that most block requests for the table are physical read requests.

    This may slow things down – on the other hand, if the table is generally responsible for a lot of the I/O anyway, it might not slow things down much for those queries; moreover, this strategy may allow other processes to find their data in the buffer more frequently, hence allowing them to do less physical I/O – which may result in the average I/O times decreasing, with the net effect that even the critical queries go faster.

    It’s counter-intuitive, but just one of the options to consider when faced with a problem like this.

    Comment by Jonathan Lewis — September 17, 2007 @ 11:05 am BST Sep 17,2007 | Reply

  4. Hmm. Interesting …

    Another, related question. How many sites do actually use DEFAULT and KEEP and RECYCLE pools ?

    The KEEP and RECYCLE cannot be dynamically added after the Instance is started ?

    So, in your experience, do DBAs that use all 3 pools have some general guidelines about when and how to begin with all 3 together ?

    For a very small number of databases, I have added a KEEP pool “so that I can use it, without a restart, if I need it”.

    Comment by Hemant K Chitale — September 17, 2007 @ 2:36 pm BST Sep 17,2007 | Reply

  5. […] version called the session-snapper. He wrote the tool because of an interesting blog about the lack of session level information. You can [download the script here|http://www.tanelpoder.com/files/scripts/snapper.sql]. He’s […]

    Pingback by Hotsos Symposium 2008 Day Two « Seven Seconds — March 5, 2008 @ 4:20 am GMT Mar 5,2008 | Reply

  6. […] Variable Performance (Aug 2007): a strategy for investigating if variations in performance are due to overloaded hardware. […]

    Pingback by Troubleshooting catalogue | Oracle Scratchpad — February 1, 2024 @ 9:01 pm GMT Feb 1,2024 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.