Oracle Scratchpad

November 10, 2007

Analysing Statspack(8)

Filed under: Statspack,Troubleshooting — Jonathan Lewis @ 2:31 am GMT Nov 10,2007

[Further Reading on Statspack]

In an earlier article on Statpack I quoted some figures taken from an article by Don Burleson:

Here is an an example of an Oracle 10g database with an undersized log buffer, in this example 512k:

Top 5 Timed Events
~~~~~~~~~~~~~~~~~~                                   % Total
Event                            Waits    Time (s)   DB Time Wait Class
log file parallel write          9,670         291     55.67 System I/O
log file sync                    9,293         278     53.12 Commit
CPU time                                       225     43.12
db file parallel write           4,922         201     38.53 System I/O
control file parallel write      1,282          65     12.42 System I/O          

I have since found on Burleson’s website the report from which this extract was taken.

The report is a Word document (dated 2nd June 2005) at this URL and it is referenced from this thread (dated 15th June 2005 – so the document was  published quite deliberately and has been on display for well over  two years) on the Burleson Consulting Oracle Forum; it would be a good idea to read both before carrying on with this blog item, which is going to discuss various statitistics in the report and suggest some preliminary observations you could draw from it. For cross-reference purposes, you might also find it useful to have the report available in a separate window as you read this blog item.

[Updated 14th Nov 2007: The document has been removed from the website although the forum thread that originally directed me to it is currently still available for review. The current version of the forum thread generates a 7 page pdf file, however, while the original version – currently still cached on Google – generates a 12 page pdf file – so just a tiny bit of history being rewritten there].

First of all we see (from the forum thread) that this is a Linux system with 4GB of RAM and a RAID-5 disk configuration. There is also a comment that this is a dedicated server – which in this context means that nothing else is running on the machine. It’s always useful to be aware of this type of information when reading through a Statspack report, as it gives us some extra hints about when some statistics are reasonable and when they might be suspect.

The report then tells us that the machine has 2 CPUs and the snapshot duration was one hour – which allows us to put the wait information into context. But before we worry about load profiles and waits, just take a look at the size of the buffer cache (124MB) and the shared pool (136MB) – that leaves use with about 3.7GB of “unused” memory. A quick check in the parameter list for the pga_aggregate_target shows that Oracle has been told to behave as if there is a (soft) limit of 92MB for process memory – and while we’re looking at the parameters we note that the sga_target has been set at 280MB. You then notice that the forum note says the database is only 300MB.

So – with 4GB of RAM, 300MB of database, and nothing running on the machine – why isn’t the buffer cache set to a value that makes the entire database operate in memory ? (This is a question I jotted down while reading through the AWR report – it’s something to follow up later. It won’t necessarily solve any performance problems but it looks like such an obvious configuration strategy that the question has to be asked).

So let’s take a look at the load profile: 7KB of redo per second, 464 logical I/Os per second, 52 block changes per second, 1 physical read per second, 2 physical writes per second, 2.66 transactions per second. On average this database is almost idle. Take a look at the the Top 5 timed events again in the light of a one hour snapshot (which, at 2 CPUs, means 7,200 CPU seconds – of which only 226 have been used) and again you can see that on average virtually nothing is happening.

But that’s one of the drawback of Statspack – you lose the detail in the average. Perhaps the problem with this system is that the activity comes in short bursts and when the system is trying to do something it does it too slowly. The report, however, was labelled as a health check – so it’s possible that the owner has simply asked for a general quality review without having any specific performance complaints. So, since we have no specific leads to follow, let’s just look for inefficiencies.

There’s not a lot to say really as we scan through the load profile and wait events. We’ve already picked up the fact that there isn’t much going on and the the average I/O times are very poor. (Probably something to do with that RAID-5 configuration).

As we hit page 6 (Service Statistics) we note that the DBA has set the system up to use database services – a nice touch with 10g – and a large fraction of the work is coming through the appropriate service.

Then we’re on to the system statistics – and nothing exciting stands out until we hit “index fast full scans (fast)”. If this is an OLTP system that’s a slightly unusual and resource-intensive execution path – and we have done 2,941 of them in an hour.

We take a brief pause at the parse count (total) – 72,000 in the hour, 20 per second – which is not a disaster (and only 627 required optimisation) but possibly some of them could be avoided. So we glance down the page to the figures for “session cursor cache hits” – and find there are none; so we look on down to the parameter list and find that the session_cached_cursors parameter is set to zero. That’s probably a mistake, and pretty safe to set to a value like 50. It will use a bit of memory (of which we have lots) and eliminate some of the CPU spent on parse calls.

The physical reads we already know are few and fast – but we do note that we have some “physical reads cache prefetch”. This is usually about pre-emptive readahead for tablescans and index fast full scans – so we want to look out for those (and, of course, we’ve already seen the index fast full scans) .

Working quickly through the redo figures: redo size + redo wastage is about 31M, and redo blocks written is 63,698 – so the redo block size is the very common 512 bytes. “redo synch writes” (the users’ calls to the log writer) is, at 9,280, very similar to the “redo writes” at 9,678 (and both are similar to “user commits” later on – so nothing of note going on there – and with an average of about 7 blocks per synch this looks like an OLTP type of system with quite small transactions.

It’s not until we get to the table scan section that there’s anything particularly striking – and there we see 138,713 rows fetched by rowid, and 75 million fetched by table scan (with 1.17 million blocks fetched by table scan). Most of the tablescans were short – so were probably constantly buffered – with a couple of “long” tablescans. (Ignoring special cases, a “short” table is anything up to 2% of the buffer cache – which means 26MB or about 3,300 blocks ).

If there’s anything in this report (apart from the slow disc configuration) that the client really needs to know it’s that they probably need to review their data and code to check that they have appropriate indexes to meet their requirements. And if the indexes are the correct indexes they need to look at the statistics to discover why Oracle isn’t using them.

The little bit about Instance Activity Stats on page 11 (combined with the redo size + redo wastage figures from earlier on) tells us that the redo log files are 10M – which is a little small for most systems. But maybe it’s going to stay a small system and the figure of 3 log switches per hour is in most people’s comfort zone. Possibly (glancing at the parameter list) we might mention the archive_lag_target parameter to the client, and explain how you can set that to ensure a log switch every N seconds.

The Buffer Pool Advisory on page 14 shows an interesting pattern – a very distinct step down in physical reads as the buffer cache changes from 60M to 72M. If you were just the wrong side of that step you’d almost certainly want to change the cache size. (Of course, we have such a large machine memory that it probably won’t make much difference to the performance since most of our reads are likely to be from local memory anyway) .

The PGA Aggr Histogram is always a useful little report – though in this case rather small. The figures we see here might prompt us to give a little warning about sorts/hashes that are a little large for a “real” OLTP system. But possible the 1M – 2M workareas were for small reports rather than for the OLTP side of the activity. We note, in passing, that the pga_aggregate_target is only 96M, which is a bit small given that processes is set to 150. A comfortable figure for an OLTP system would be in the 2M – 4M range per process. The advisories are basically then suggesting giving a bit more memory to the pga and sga – which is something we can easily afford to do.

The next interesting spot appears in the Undo Segment Summary (page 19) – do we really trust that “Max Qry Len” of 1,665 seconds? Better make a little comment about it, just in case. It might be indicative of an important report; it might be a bad piece of code.

The latch activity is too low key to be interesting.  The only slight anomaly is the relatively high value for get of the “row cache objects” latch relative to the “library cache latch”. This can be indicative of excessive optimization (a.k.a. “hard” parsing) – but since we aren’t doing much hard parsing we carry on down to the Dictionary Cache Stats and notice that the higher numbers come from the dc_global_oids, dc_object_ids, and dc_users caches. This may be the slightly undesirable side effect you get from using the cast() operator with object table types, perhaps in the guise of pipelined functions. For a high performance, highly concurrent system that’s a bit of Oracle’s technology that I’m not too keen to see in action.

The Library Cache Activity shows a small number of reloads – possibly an indication that we should increase the size of the shared pool – possibly a hint that there’s some literal string code arriving that’s kicking more useful information out a bit too enthusiastically. But at the current work rates it doesn’t seem to be much of a threat. A quick check of the parameter list also shows that cursor_sharing is set to force, so we might want to check if some of the code has mixed binds with literals – which is a code style for which this directive is inapplicable in later versions of Oracle. (It’s also a directive that, strategically, should be eliminated).

Looking at the parameters – we may want to get rid of the db_file_mulitblock_read_count as this is 10g, and ensure that the db_cache_size is always at least as large as 1M per process.

Summary

If you want a quick healthcheck on this database because you feel a little uneasy about the way it’s being developed then:

The disk write times are bad – possibly because it’s RAID-5.

Some of the queries are probably very inefficient because of indexing errors, or unsuitable statistics or code. You might as well use the machine memory you’ve got in the best place possble – which means putting more of it into the SGA and PGA. At present this probably won’t make much difference to performance because of the resources wasted in tablescans.

The session_cached_cursor parameter should probably be set to a non-zero value for a small win in CPU.

It would be a strategic target to make sure the coding strategies are correct, and then get rid of cursor_sharing = force.

Be a little cautious about using SQL that includes structures like cast(multiset(…) as user_defined_table_type)

That’s just a 90-minute review (including typing time) – so if you think there are other important things worth mentioning, or think I’ve mentioned something that’s irrelevant, then that’s what the comment section is for.

[Further Reading on Statspack]

Footnote: Any advice about reading statspack reports is almost always relevant when reading AWR reports (and vice versa).

21 Comments »

  1. Beautifully explained keen areas to work on.

    Comment by Zaffer Khan — November 10, 2007 @ 12:26 pm GMT Nov 10,2007 | Reply

  2. Dear Jonathan,

    What is usual figure of index fast full scans on OLTP system in an hour?

    Why physical reads on load profile don’t include other physical reads e.g. physical reads cache, physical reads cache prefetch, etc. ?

    Best Regards,
    Joe

    Comment by Bunditj — November 10, 2007 @ 7:22 pm GMT Nov 10,2007 | Reply

  3. Joe, there is no “usual” figure for index fast full scans on an OLTP system. Like tablescans, index fast full scans aren’t automatically bad – but you don’t really expect to see many of them. Given that it’s a low-activity system (only 50 sessions live at start and end of snap, not much work being done) it’s just a little odd to see such a regular occurrence. (It could be one per minute per session – which might mean something interesting; but it might simply be a couple of sessions that wake up every few seconds to run a test loop of some sort – with this being the most efficient option).

    Your question on physical reads has two interpretations – so I’ll address both.

    If you mean “why doesn’t the load profile have extra lines for other read events” then I think the answer is ‘to keep it short’.

    If you mean “why doesn’t the figure include subtotals for the other kinds of reads” – it already does.

    In fact, the name is very misleading – it should be more like “blocks read from I/O subsystem” rather than “physical reads”, as the latter suggests it is counting read requests and not making any allowance for the size of the request.

    To make sense of “physical reads” you have to cross check all the statistics relating to physical reads, deciding which ones include which others; then check the wait events for anything to do with reads; and in some cases you have to finish off with a sanity check against the file I/O stats.

    Comment by Jonathan Lewis — November 11, 2007 @ 3:30 pm GMT Nov 11,2007 | Reply

  4. “Looking at the parameters – we may want to get rid of the db_file_mulitblock_read_count as this is 10g, and ensure that the db_cache_size is always at least as large as 1M per process.”

    I hadn’t realised that this interpretation of DBMRC had changed in 10G. Your minor comment lead me to re-read the DB reference:

    http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/initparams047.htm#sthref197

    If I read this documentation correctly, the RDBMS now defaults this value to the largest possible on the platform, but may also modify (dynamically?) the value according to the number of sessions on the system. The doco implies that the optimizer knows whether the setting is the default, or else a manually defined large value, and treats each differently.

    Can you maybe share a few words on this change in 10G?

    Regards,

    Comment by Mathew Butler — November 12, 2007 @ 6:09 pm GMT Nov 12,2007 | Reply

  5. Mathew,
    Strategically it makes sense to eliminate the db_file_multiblock_read_count from your parameter file in 10g. If it is not set, then Oracle uses two other (hidden) parameters in its place. One of the “run time” read count, the other is the “optimizer” read count.

    The run time read count (_db_file_exec_read_count) is set, as the manual says, at the largest value supported by the operating system with the proviso that Oracle “wants” every process (not session) to have a fair share of the buffer cache for tablescans and index fast full scans, so there is a limiting factor of “total cache size” / processes (the parameter) as the maximum size of a multiblock read. This is then expressed as a number of blocks using the default block size. It seems to be “dynamic” only in the sense that it is derived at startup; it doesn’t change from moment to moment as processes start and stop.

    The optimizer read count (_db_file_optimizer_read_count) is used by the optimizer to substitute for the MBRC if you are using “noworkload” system statistics. It seems to be fixed at 8 – a relatively low value which helps to reduce the risk of the optimizer choosing an insane tablescan/hash join path when an index-driven path would be more efficient.

    Comment by Jonathan Lewis — November 16, 2007 @ 9:00 am GMT Nov 16,2007 | Reply

  6. >> I have since found on Burleson’s website the report from which this extract was taken.

    You have a lot of nerve copying my personal content onto your web site (Although I understand it, since you have none of your own to share). GET A LIFE.

    This content was clearly labelled “BEI CONFIDENTIAL”, it was not for you to steal, and only a really unprofessional person would disclose it.

    I have half a mind to tell the client who owns the report. Trust me, it would be very, very unpleasant.

    Effective today, you will remove ALL my property from your website IMMEDIATELY.

    Comment by Donald K. Burleson — November 17, 2007 @ 3:36 pm GMT Nov 17,2007 | Reply

  7. Don,

    If you look closely, you will see that I have not copied the report – I have supplied two URLs: one directs the reader to the report, the other directs the reader to the posting in your forum where you link to the report and invite the world at large to read it (Edited for emphasis, see comment 9 – jpl). So does that make you a really unprofessional person for disclosing the report as well ?

    … “the client who owns the report” … “remove ALL my property”..

    Make your mind up: is the report your property or is it your client’s ? If it’s yours what difference does it make if you tell the client; and if it’s your client’s, who’s going to be on the end of the unpleasantness when you tell them that I created a link to a report that you published for public review ?

    I note, in passing, that your original log buffer article has changed again – where you once had the comment “(This is the database as I found it, and there were lots of other issues too):” you now have the comment ” (This is the database as I found it, and there was a serious data buffer shortage causing excessive disk I/O):

    I’m not surprised you’re cross that I’ve drawn attention to the report, given that it shows 1.03 physical reads per second and 2.80 physical writes per second – which hardly seems an excessive amount of disk I/O.

    Funnily enough, one of your responses to my original post was that I couldn’t possibly guess what the problem might be on such a small amount of evidence. On the other hand, the diagnosis in your forum article reads: “even though I have identified the issue as a disk update bottleneck (RAID 5)…”. So it looks like we got to the same diagnosis – even though I had so little to work on.

    Comment by Jonathan Lewis — November 17, 2007 @ 6:02 pm GMT Nov 17,2007 | Reply

  8. >> So does that make you a really unprofessional person for disclosing the report as well?

    Yes, it does. I would never reproduce an un-linked Word doc that was marked CONFIDENTIAL on every page.

    If your’re such an in-demand tuning expert, how come you need to pinch my data?

    Comment by Donald K. Burleson — November 17, 2007 @ 9:52 pm GMT Nov 17,2007 | Reply

  9. >> I’m not surprised you’re cross that I’ve drawn attention to the report

    Oh my, you are quite full of yourself, I see. Does this sound familiar?

    http://www.dba-oracle.com/t_narcissistic_personality_disorders.htm

    I’m not surprised that you are forced to lift other people’s reports, since you don’t have any STATSPACK reports of your own to talk about.

    Comment by Donald K. Burleson — November 17, 2007 @ 10:00 pm GMT Nov 17,2007 | Reply

  10. >> So it looks like we got to the same diagnosis – even though I had so little to work on.

    There is a BIG difference. My diagnosis was based on lots more information. You just guessed, that’s reckless.

    Will there be a Jonathan Lewis Psychic Hotline soon?

    Comment by Donald K. Burleson — November 17, 2007 @ 10:12 pm GMT Nov 17,2007 | Reply

  11. Jonathan,

    Thanks greatly for the reply. This is really useful information and a very clear explanation of how it works. Now I just need to find the time to test it!

    I need to give some thought to the tuning implications here for systems that have a small number of components that need to do large FTS, but where memory may be limited, so setting a large enough db_cache_size may not be feasible for this set up….of course, DBFMRC is session settable to this could always be overridden at the session level to get the required behaviour.

    If such a session over ride is sensible it reminds me of the similar approach for large sorts/hash joins when the 5% of PGA_AGG_TARGET allocated per session just isn’t enough and manually setting SORT_AREA_SIZE etc is required.

    I note that this is documented on Oracle-l quite some time ago – mid 2006 via a post from you ( and others ).

    You should be commended for your ongoing contributions to the Oracle community.

    Best Regards,

    Mat.

    Comment by Mathew Butler — November 18, 2007 @ 10:52 pm GMT Nov 18,2007 | Reply

  12. Just a brief update to say that the original word report has been removed from Don’s site (though not yet the links to it from various of his own web sites) meaning that your excellent advice at the start of the article is now not possible to follow.

    Comment by Niall Litchfield — November 20, 2007 @ 4:02 pm GMT Nov 20,2007 | Reply

  13. if we eliminate the db_file_multiblock_read_count isn’t there a discrepancy from the number of disk accesses the optimizer thinks to have to do (based on the default 8 blocks of _db_file_optimizer_read_count ) and the real number of accesses it will do executing the query?

    Best Regards,
    Davide

    Comment by Davide — December 24, 2007 @ 2:11 pm GMT Dec 24,2007 | Reply

  14. Davide,

    Unfortunately this is true; and the difference between the setting and the actual average read size has been a problem to the optimizer for many years. (If you set a multiblock read size of 128, say, the optimizer would “adjust” this to a value around 40 for the purposes of doing the arithmetic – but who knows what might really happen at run-time ?).

    The effect of the defaults for optimizer exec and optimizer read counts will cause the optimizer to increase the relative costs of tablescans. If you view this as Oracle Corp’s attempt to minimise the number of queries that do the wrong thing, it’s probably a good strategy. Moreover, if a query uses an indexed access path when it should be doing a tablescan, putting in hints to do tablescans is probably going to be a less fragile fix than trying to hint a query that does tablescans when it should be using indexes.

    So it’s a compromise – but I think it’s probably a statistically wise compromise.

    Comment by Jonathan Lewis — December 26, 2007 @ 11:44 am GMT Dec 26,2007 | Reply

  15. I am curious if the report was clearly marked confidential when you started working on it? If so, most interpretations of data privacy laws, including those the criminal courts have accepted, would say that you have broken the law by accessing it when you are not authorized to do. I shudder to think what your civil liability would be for publishing the analysis.

    Comment by gocats — December 30, 2007 @ 11:47 pm GMT Dec 30,2007 | Reply

  16. Gocats,

    You only have to read this sentence from the blog to recognise the flaw in your argument:

    The report is a Word document (dated 2nd June 2005) at this URL, and it is referenced from this thread (dated 15th June 2005) on the Burlseon Consulting Oracle Forum;

    Comment by Jonathan Lewis — December 31, 2007 @ 7:04 pm GMT Dec 31,2007 | Reply

  17. I suggest you read up on the law. The fact that the document was available does not make it fair game. Thats the whole point of the anti-hacking laws. Just because someone can access data does not mean it is legal for that person to access the data. The way the law has been interpreted, or at least the way I read the opinions is, if it was the clear intent of Burleson for it to remain private (and marking it confidential would make that intent clear), it is not legal for you to read and comment on the document.

    Just because someone leaves the front door open doesn’t mean its legal for you to go in and sleep on the couch.

    Comment by gocats — December 31, 2007 @ 11:09 pm GMT Dec 31,2007 | Reply

  18. Gocats,

    Looks like you still haven’t read the blog. Here’s a quote from Don Burleson’s own entry on his own forum (from 15th June 2005 until at least 4th Nov 2007):

    “Now I haven’t had time to pour-over this AWR report in-detail, but I’m inclinded to agree in this case, that manual assignment of db_cache_size is appropriate.

    Here is the report:
    http://www.dbaoracle.com/reports/awr_report1.doc

    I don’t think many people would read that and say that the intent was for the report to remain private. (The link no longer works – as I pointed out when I updated my blog entry)

    By 11th Nov 2007 that comment had been edited out from Burleson’s posting, and from a posting by another user who had quoted that part of Burleson’s post.

    Comment by Jonathan Lewis — December 31, 2007 @ 11:42 pm GMT Dec 31,2007 | Reply

  19. Hello Jonathan,

    I have a database with following behaviour.

    only one SELECT statement is using 60% CPU, all other looks normal using 3-4-5% I dig more in to that SELECT statement and found it has SELECT DISTINCT….. UNION SELECT DISTINCT… kind format. After looking at Execution plan, I see all steps are using CPU normally excepting last statement ( In order of execution) SORT UNIQUE is spending 85% CPU out of 100% CPU for that statement.

    There is some suggestion like use UNION ALL instead of UNION, dont use DISTINCT.

    Second thing is, I thought to check how that statement is parse and executed. That statement was executed 4000 times, hardparse 1 time and soft parse 3999. It means it is doing soft parsing each time it execute it. SO question is how to resuce that part. I think that is causing some Library Cache, as I see that in AWR report as top 5 events.

    ThenI thought to look at SESSION_CACHE_CURSORS, which is set to 150. In AWR, session cursor cache hits -67350 142.01 8.04 , this is based on 8 minute of AWR.

    Now how to find out weather SESSION_CACHED_CURSORS is set to appropriate or needs to increase it. Please provide some direction about how to decide/tune SESSION_CACHE_CURSORS ?

    We can not change application because it is third party application. how to tune this.

    Please provide some lights on this topic.

    Thanks

    Comment by Keyur — January 22, 2008 @ 7:34 pm GMT Jan 22,2008 | Reply

  20. Keyur, please note that I don’t do requests – this isn’t AskTom.

    However – the UNION / DISTINCT: if you take out the DISTINCT and turn the UNION into a UNION ALL, you will have changed the logic of the query so, unless the need for uniqueness was an error, you can’t make those changes.

    Note – you haven’t mentioned the version of Oracle but most versions will postpone the DISTINCT operator by doing a UNION ALL and then applying the SORT UNIQUE.

    As far as setting session_cached_cursors is concerned a simple guideline is to increase it until the statistics about cached cursors suggest that you will get minimal extra benefit from increasing the cache further. One way of viewing this is to see if lots of sessions have reached the limit and are still doing lots of parses which are not “parse count (hard)”.

    Comment by Jonathan Lewis — January 22, 2008 @ 8:49 pm GMT Jan 22,2008 | Reply

  21. […] 原文:https://jonathanlewis.wordpress.com/2007/11/10/analysing-statspack8/ […]

    Pingback by xpchild » analysing statspack 8 — December 16, 2011 @ 8:15 am GMT Dec 16,2011 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.