Oracle Scratchpad

May 30, 2008

Nothing Changed

Filed under: CBO,Performance,Statistics,Tuning — Jonathan Lewis @ 10:42 am BST May 30,2008

The following request appeared on the Oracle Forum a few days ago:

I have a select query

select col1,col2,col3 from table1 order by col1,col3

This table contains 4.5 million records initially and the select was returning records in less than 2 minutes .

This query has been running for the last year with out any issues, but from last week onwards this query is taking more than 15 min to complete . No change in database and other components.

How can I find out the root cause of this issue ? Any specific area I need to check ?

There are two important points  in this request – first the statement that “nothing changed”, and second the need to discover “what was happening in the past”.

Obviously something did change. It may have been the statistics, it may have been both the data and the statistics.  (And it may have been the index definitions that changed, or an index may have been rebuilt, or some parameter may have been adjusted).

Since the query is now persistently slow, the most innocent explanation is that the data has changed; the second most innocent explanation is that the statistics have changed. The former could change the execution time, the latter could change the execution plan – which could be a poor plan and change the execution time.

The query (as supplied) is not selective. It’s got to acquire 4.5 million rows and supply them in order. There are only a few execution paths for the query, and these depend on the available indexes and constraints:

  1. Full tablescan with sort
  2. Index fast full scan with sort (if an index holding all three columns exists)
  3. Index full scan with no visit to table and no sort if the index in (2) starts (col1, col3)
  4. Index full scan with table access by rowid (and no sort) if an index starts (col1, col3) but doesn’t hold col2

Increase the amount of data and the same plan could take a lot longer: a sort might spill to disk instead of completing in memory; a full scan with or without table access might lose a caching benefit if the index (option 3) or table (option 4) cannot be kept in memory while the scan progresses.

So how you do find out what was happening when things went well ? Possibly the documentation includes a statement of the expected execution plan. Possibly you’ve been running statspack or AWR and this was a query that was captured with its execution plan. Possibly the only thing you can do is guess.

Work out all the possible execution plans for the query, then use hints to check the performance for each execution plan.  If one of the plans drops back to 2 minutes, then maybe that’s the plan you used to have and you work out the root cause by comparing the old plan, the new plan, and the costs.

If none of the other plans gets you close to the original execution time, then something has changed on your system. (Although, possibly, you aren’t doing the tests fairly).

13 Comments »

  1. Its not straight comment on this. However I have a question.

    These kind of experiences, was good but bad now, have become quite common. Many complaint about this.

    Do you suggest any design strategy taking anticipating these issues in mind? I am not sure if this works, but can the statistics be captured (after analyzing) timely, say every week? (may be the exec plan as well) Will it help? When the performance started going to bad, we may compare the current stats with history?

    If this is one considerable option how do you suggest the design of the collection mechanism?

    Thanks!

    Comment by Karteek — May 31, 2008 @ 9:12 am BST May 31,2008 | Reply

  2. Reminds me of the time our sys admins were so proud that they managed to migrate us from fast expensive storage to slow cheap storage without

    a) an outage
    b) any notification
    c) any clue of the ramifications that subsequently followed…

    :-)

    Comment by Connor McDonald — June 2, 2008 @ 5:11 am BST Jun 2,2008 | Reply

  3. The phrase “This query has been running for the last year with out any issues, but from last week onwards this query is taking more than 15 min to complete” looks interesting!

    And if you read “No change in database and other components” as “no news tables, index or any modification to the server”, you can cut-out some possibilities.

    It looks like some data problems, i.e. the table was a “little table” but from last week it becomes a “big table”, so no more cache…

    Another cause could be the indexes: perhaps they need rebuild, but in this case the table as grown and there was some huge update activity.

    Bye.

    Comment by lascoltodelvenerdi — June 3, 2008 @ 10:28 am BST Jun 3,2008 | Reply

  4. I think the words “nothing has changed” in a “why did performance tank?” question, should be added to a list of famous last words (if there is such a thing). It always makes me think of the last words attributed to composer Ludwig van Beethoven: “Friends applaud, the comedy is finished.” So, now every time someone says “nothing changed”, I simply want to applaud and laugh. But, given that you can’t really do that in a client or colleague’s face without some pretty serious possible consequences, I’d say Jonathan is spot on about just having to work your way through the possibilities where change could have occurred until you find the answer.

    Comment by Karen Morton — June 4, 2008 @ 3:58 am BST Jun 4,2008 | Reply

  5. “So how you do find out what was happening when things went well ?”
    But even if this is possible, is it actually the best approach to resolving the problem ?
    By saying “nothing changed”, it is more that “I didn’t change anything”, so there’s a good chance you can’t undo what did change (eg which blocks Oracle chose to sample in DBMS_STATS, the order or nature of fresh incoming data…).
    Or you find out the plan changed, but it leaves you none the wiser about why a different plan is being chosed. Or that the plan changed because the clustering factor changed but you don’t know what caused that.
    And even if you know a root cause change (eg implemented a patchset), you may still not be able to go back.

    On the other hand, you can take a ‘here and now’ tuning approach and look at what the SQL is doing now, consider what alternatives are available and how they might be better. You don’t have different approaches for brand-new SQL (where you don’t have a history), SQL that has changed (where there was a history for the old SQL but you may not be able to find it), SQL that hasn’t changed where the plan has, or the plan is the same but there’s more IO waits, or the plan and IO is the same but there’s more idle waits…

    PS. For the OP, I wonder where the 4.5 million rows of result set were going. Maybe the holdup was in the network, writing to a destination file….

    Comment by Gary — June 4, 2008 @ 5:40 am BST Jun 4,2008 | Reply

  6. @Gary

    “where the 4.5 million rows of result set were going”

    You pointed out a good one!

    Many times we work in a multi-tier environment and stuck with the DB while the problem is elsewhere.

    Comment by lascoltodelvenerdi — June 4, 2008 @ 6:48 am BST Jun 4,2008 | Reply

  7. “No change in database and other components”
    Rule No 1 – The only consistent is change.

    Comment by Slater — June 4, 2008 @ 11:09 am BST Jun 4,2008 | Reply

  8. Karteek,

    Certainly one of the pre-emptive measures is to ensure that any time you collect statistics, backup the old statistics. See https://jonathanlewis.wordpress.com/2006/12/03/saving-statistics/ for starters. Another ‘pre-emptive’ strategy is to accumulate a library of execution plans for every query – but since this is very labour intensive you may have to start with the more resource-intensive queries and work downwards. One low cost, but limited, strategy is to collect the sql_id (or hash_vlaue) and plan_hash_value for statements in the library cache on a fairly regular basis – and check on statements that show intermittent changes in plans.

    lascoltodelvenerdi,
    The “short table” that became a “long table” is a nice example – it only takes one row for the high water mark to move, although you would still have to update the statistics if they had been previously collected. See this as an example: https://jonathanlewis.wordpress.com/2007/07/29/nls/

    Gary,
    Obviously I’m not going to disagree with the “fix the problem that’s here now” – but what we’re talking about is really a minor change in emphasis. You may want to work out an efficient execution plan from scratch – if I find the documentation that describes what the execution path should be, or the listing of the last known good execution plan, then I have the chance of fixing the immediate problem much faster then someone working out a good plan from first principles.

    Either way – I think we would both compare the resulting good plan with the bad plan to work out why the good plan didn’t appear automatically since the change may indicate an inherent instability that needs to be addressed more carefully.

    lascoltodelvenerdi,
    I remember working for at a company where one person wheeled a trolley loaded with computer print-out into the trading room at the start of each day, and another person wheeled it out to the shredder at the end of each day – and no-one looked at a single sheet of it during the entire day. There was a very easy option to improve the performance of the overnight batch run.

    Comment by Jonathan Lewis — June 10, 2008 @ 9:59 am BST Jun 10,2008 | Reply

  9. Thanks Jonathan…it helped me! Choosing date as statid was a good idea.

    Comment by Karteek — June 13, 2008 @ 2:17 am BST Jun 13,2008 | Reply

  10. Wondering whether another thing that could have changed is the ‘number of users’. Since this involves a significant sort, is it not possible that an increase in user count could affect the PGA size, thereby potentially moving from in-memory to n-pass sorts?

    Comment by Hans — July 20, 2008 @ 3:49 am BST Jul 20,2008 | Reply

  11. Hans,

    A valid point. A larger number of users would result in a global reduction of the PGA memory available for workarea operations.

    This wouldn’t change the plan but, as you point out, could mean that at run-time a query might change from in-memory to one-pass, or from one-pass to multipass because of a global limit imposed by the target.

    Comment by Jonathan Lewis — July 20, 2008 @ 8:56 am BST Jul 20,2008 | Reply

  12. […] moments to let you know about a client I visited a few weeks ago who didn’t give me the usual “nothing’s changed” […]

    Pingback by Something Changed « Oracle Scratchpad — March 19, 2010 @ 9:50 pm GMT Mar 19,2010 | Reply

  13. […] I’ll have to go back a couple of versions of Oracle to repeat these tests – maybe this behaviour with user_indexes.distinct_keys in place is relatively recent, but it’s another reason why execution plans may change suddenly and badly as time passes when “nothing changed”. […]

    Pingback by Index out of range | Oracle Scratchpad — March 27, 2017 @ 8:43 am BST Mar 27,2017 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.