Oracle Scratchpad

December 23, 2013

Random slowdown

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 6:48 pm BST Dec 23,2013

One of the old, old questions came up on OTN a few days ago:  “What reasons are there for a piece of (batch) SQL to run very slowly occasionally when it usually runs very quickly ?” Then a similar one appeared a couple of days later. There are plenty of reasons, so I thought I’d start listing a few and see how many other (realistic) reasons other people wanted to add.

  1. A change in execution plan due to unlucky bind variable peeking with histograms
  2. A change in execution plan because the statistics have been slowly going out of sync with the data/query
    1. Special case: highly volatile tables, where the data/query/stats can go badly out of sync very quickly
    2. Special case: dynamic sampling used to bypass volatility problems – but sometimes the sample is an “unlucky” one.
  3. A change in execution plan because new (and unlucky) statistics have been generated since the last time query ran.
  4. A change in execution plan because a few new, empty, partitions have been added to a critical table
  5. An unlucky change in execution plan because a small change in actual data volume (with correct stats in place) can have a dramatic impact on the shape of the plan
    1. Variant – the plan stays the same, but the execution mechanism changes because a table has grown and pushes Oracle into direct path reads
    2. Variant – as 5.1, but the change in mechanism is not due to the size of the table, but the current degree of caching of the table (see also 10)
  6. A change in execution plan because you had an incomplete set of hints in your SQL and your luck just ran out
  7. An unlucky change in data order that has a big impact on the success of subquery caching
    1. Special case – highly volatile tables being used as the target of existence subqueries (or “first_row” optimisation)
      1. Can be due to the actual placement of the “useful” data in the table compared to the optimizer’s “average” model.
      2. Can be due to stats and data not matching very well
  8. A small slippage in timing that results in the SQL suffering a direct data collision (locking / read-consistency) with some other process.
    1. Special case – a small slowdown resulting in the process being restarted, possibly many times, and colliding with copies of itself
  9. A small slippage in timing that leaves the SQL running concurrently with something that is processing unrelated data but competing for the raw (disc / CPU / PX Slave) resources.
    1. Interesting example – competition for the queue on an HBA, especially when the queue depth is small
    2. Special variation – timing problems due to month-end/year-end increasing volume to be processed and adding extra jobs
    3. Common variant – even without a time slip – you may simply be competing with some other system that’s sharing the same physical resources
  10. A small slippage in timing that means some other process cleared everything you wanted from the buffer cache before your SQL started running.
    1. RAC variant – the data is cached in another node because e.g. some other process ran on a different node
  11. A small slippage in timing that means a process that normally follows your SQL preceded it, and you’re spending all your time on delayed block cleanout.
  12. Someone dropped some indexes (or made them invalid)
  13. Someone rebuilt one or more indexes since the last time the SQL executed
  14. Someone has done a “shrink space” on a critical table since the last time you ran the SQL
    1. this could be due to a change in execution plan due to a change in index clustering factors
    2. it could be due to a change in physical I/O requirements because of a change in data placement – even with the same execution plan
    3. it could be due to the subquery-caching effect due to a change in data access order – even with the same execution plan
  15. A space management, or similar, problem suspended processing for a while – e.g. archiver stuck, temp space full for resumable session.
  16. You’re operating over a database link, and the problem is either in the connection, or at the far end of the link
    1. The problem may also be about execution plan changes that have a more extreme side effect because the plan includes a remote database
  17. Problems with SQL Baselines or profiles (profiles going out of date, or a new Baseline being accepted – perhaps by the 12c automatic job – that blocks a previously stable plan)
  18. It’s not the SQL, it’s whatever is calling the SQL (surrounding pl/sql, network to client, client data handling)
  19. The job was running in a resource group and a change in workload of other processes made the resource manager de-schedule this process even though its workload hadn’t changed
  20. The SQL hasn’t changed plan, or volume of data accessed, but it calls a pl/sql function that executes some SQL that has changed it’s behaviour for some .

I’ll add more as I (or commentators) think of them.

43 Comments »

  1. Hi Jonathan,

    If you’ve been doing Oracle tuning long enough, each of these have probably hit you at one time or another.

    I’m interested in some advice on the technique you used to troubleshoot them. For example, the first 3 are notoriously bad. If you have a Java or .Net application sending in thousands of SQLs, how do you capture which one went off the deep end without turning on SQL Trace for them all and combing through gigabytes of trace data?

    If I have access to the source code, then I have a few more options. When you’re brought in as a consultant, you usually don’t have that option. What do you do?

    Regards,
    -Peter

    Comment by Peter Tran — December 23, 2013 @ 7:07 pm BST Dec 23,2013 | Reply

  2. A pseudo random one to add to your very fine list (and this IS one of those things that seems unfortunately to come down to a laundry list of things to check [ at which one may fail at enumeration even if only because a new reason crops up]): Archiver stuck. Now, sometimes this will slap your DBA in the face, but if someone has automated adding more space to the archive destination or relocation to a new destination, this can add up to something that is “handled” and therefore not on the routine reports, even though it may have held things up for an arbitrary period if the solution required some provisioning and asynchronous intervention.

    As for trouble shooting solutions, usually for me the start is a quick look a things that change plans or verifying the same plan is in fact in use, then go to what is consuming service time and wait time in the new, slower, event.

    Of course it was a big advance when it became possible to “unstick” the archiver (quite some time ago), but it did inject this new event into the equation, regarding “stuck time” you do not usually investigate.

    Comment by rsiz — December 23, 2013 @ 7:30 pm BST Dec 23,2013 | Reply

  3. Jonathan,
    another one encountered: the service has moved to another node in RAC, data is no more “local” (related to 10)
    Regards,
    Eric

    Comment by Eric Grancher — December 23, 2013 @ 8:13 pm BST Dec 23,2013 | Reply

  4. For repetitively running sql, whatever set it off originally now compounded by multiple simultaneous runs. (I had your #8 in mind when I posted in that second thread, before I saw this post.)

    Comment by jgarry — December 23, 2013 @ 8:33 pm BST Dec 23,2013 | Reply

  5. Someone pressed the Turbo Button on the server…

    http://en.wikipedia.org/wiki/Turbo_button

    Comment by Reinhard Hillefeld — December 23, 2013 @ 8:48 pm BST Dec 23,2013 | Reply

  6. We’ve recently started hitting #11 a lot. Is there a tuning document someone can point me at that describes how we can have a background process handle the delayed block cleanouts so that the second statement doesn’t incur the costs of performing the delayed block cleanouts?

    Comment by Roger — December 23, 2013 @ 9:52 pm BST Dec 23,2013 | Reply

    • Roger,

      I don’t think I’ve seen a current document that would be sure to help.

      The “classic” solution is to know which bits of your application are likely to generate a lot of blocks that will need delayed block cleanout and include some code to revisit the dirty blocks after the process is complete. Unfortunately the latest versions of Oracle seem to have got a bit clever about caching transaction commit times and can avoid doing cleanout in some circumstances which I haven’t yet classified.

      Comment by Jonathan Lewis — December 31, 2013 @ 3:12 pm BST Dec 31,2013 | Reply

  7. If the database server is sensitive to queue length changes (eg really low maximum queue depth on the LUNs), then the size of the tables in the query can become really important. If the size jumps over a threshold such that the SGA database cache insertion point for the blocks changes and the blocks are used by many queries, many of the queries may get considerably lower cache hit ratios and the total read operations can skyrocket. Along the same lines, if the table crosses the small table threshold, and execution engine chooses direct path reads full table scans of a table used by many concurrent queries, the repetitive reads of the same blocks can send the total read operations skyrocketing and overwhelm queue depth.

    Comment by lonnyniederstadt — December 23, 2013 @ 11:36 pm BST Dec 23,2013 | Reply

    • lonnyniederstadt

      I’ve added 5.1/5.2 for your comment about direct path reads, and 9.1 for the queue depth one (which I have seen causing an astonishing delay in the past).

      Comment by Jonathan Lewis — December 31, 2013 @ 3:18 pm BST Dec 31,2013 | Reply

  8. How about: A change was made to one of the many parameters that control parallel operations.
    parallel_max_servers, parallel_degree_policy, parallel_force_local (RAC), etc.

    Comment by jkstill — December 24, 2013 @ 12:13 am BST Dec 24,2013 | Reply

  9. then of course…there’s all those external things that you can at least see…

    - the backup runs into the batch run cycle, or commences too early
    - something OTHER than oracle on the machine is sucking up all the resources

    then the really external things that you cannot see

    - some other VM is hosing the same physical gear
    - someone has diddled with your VM
    - someone has diddled with your storage
    - someone has diddled with your path to storage

    the list goes on and on :-)

    Comment by connormcdonald — December 24, 2013 @ 12:20 am BST Dec 24,2013 | Reply

    • Connor,

      I think I’ll count that lot under 9.

      Having someone mess up the machine configuration behind your back probably doesn’t fall into the caterogy of “random slowdown” though (I hope) – I think some of your suggestions fall into: “the system’s wrecked – who’s been messing with it”.

      Comment by Jonathan Lewis — December 31, 2013 @ 3:23 pm BST Dec 31,2013 | Reply

  10. How about the simple things first? As in: “It’s the EOM/EOY, and data has significantly increased in volume”.
    Sometimes it pays to check the easy stuff first.

    Comment by Nuno Pinto do Souto — December 24, 2013 @ 12:45 am BST Dec 24,2013 | Reply

    • Nuno,

      I think this probably falls into 9 (unrelated competition) – but it’s a good one to raise as a special (easy to check) case, so I’ve added it at 9.2

      Comment by Jonathan Lewis — December 31, 2013 @ 3:30 pm BST Dec 31,2013 | Reply

  11. Hmmm, OK: “Something (or many things) changed, we just need to find out which changes are responsible for this issues”
    Back to square 1.

    Comment by jkstill — December 24, 2013 @ 1:34 am BST Dec 24,2013 | Reply

  12. The list is close to being exhaustive, but I would add some structure to it, perhaps group everything into categories like:
    1. More I/O (because of less efficient plan)
    2. More I/O (because of more data or different data)
    3. Same type of I/O takes longer (e.g. I/O saturation or a hardware failure)
    4. Drop in caching efficiency (on database, OS or storage layer)
    5. Direct concurrency
    6. Indirect concurrency (read consistency overhead)

    Comment by savvinov — December 24, 2013 @ 7:55 am BST Dec 24,2013 | Reply

    • Nikolay,

      I think there are two directions to approach the problem from – one is “what is the problem?”, the other is “what are the symptoms?”. The list is aimed at the first approach. We can look at symptoms later and talk about how to map between symptoms and causes. One of the problems with discussing symptoms is that it’s too easy to oscillate between the helicopter view and the microscopic view while failing to connect with the user experience.

      Comment by Jonathan Lewis — December 31, 2013 @ 3:39 pm BST Dec 31,2013 | Reply

  13. 1. Someone has changed one or two optimizer parameters at a system/session level so that the query execution plan changes for the worst
    2. Someone has change the optimizer_use_sql_plan_baselines to FALSE so that a new plan has been used instead of a known stable SPM execution plan.
    3. Someone has dropped/modified an object so that a used SPM baseline ceases to be reproducible and went for a new sub-optimal plan.

    Mohamed Houri

    Comment by hourim — December 24, 2013 @ 8:33 am BST Dec 24,2013 | Reply

    • Mohamed,

      I’m inclined to ignore that “changes in parameter” class in the context of “random changes” – although the first time a catastrophic performance problem appears it’s probably one of the first things to check (along with Connor’s list of how to wreck the system).

      The potential for a change due to “losing” or gaining an SQL Baseline (or profile) is one I’ve added as 17

      Comment by Jonathan Lewis — December 31, 2013 @ 3:49 pm BST Dec 31,2013 | Reply

  14. 1) parameter optimizer_mode is changed.
    2) parameter optimizer_index_cost_adj is changed.

    Comment by vijay sehgal — December 24, 2013 @ 10:10 am BST Dec 24,2013 | Reply

  15. Here’s a few I can think of:

    - the query is over a database link and stats/data/indexes/etc. have changed on the remote rather than local system
    - the query is over a database link and the driving site has flipped from local to remote or vice-versa
    - the query is over a database link there’s issues with the network between the databases
    - an SQL profile has been created (which gives worse performance)
    - the underlying storage is shared with some other system which occasionally makes large I/O requests at the same time as your SQL
    - the SQL drives some row-by-row processing and it’s the processing that takes longer, not the SQL itself

    Comment by Chris Saxon — December 24, 2013 @ 10:54 am BST Dec 24,2013 | Reply

    • Chris,

      I like the database link comments – added in at 16. I’ve also include broad brush comments about profiles and SQL Baselines at 17, external systems at 5.3, and the “not the SQL” as 18

      Comment by Jonathan Lewis — December 31, 2013 @ 3:56 pm BST Dec 31,2013 | Reply

  16. Tables containing volatile data such as AQ tables. The plan may stay constant but the underlying segment could move from 0 rows to 1000s of rows causing performance to fluctuate and then improve as data volumes drop again.

    Comment by Neil Johnson — December 24, 2013 @ 12:14 pm BST Dec 24,2013 | Reply

  17. “Random slowdown” is of course the reason for this kind of questions – though the effect could also be a random speedup, but that’s seldom a reason for complaints, I guess.

    The threads also show an other typical aspect of such general questions without much specific information: the OP does not reappear after asking his question.

    Comment by Martin Preiss — December 24, 2013 @ 9:43 pm BST Dec 24,2013 | Reply

  18. 18. Here are a few that I can remember which I came across in the past.

    a. Competing Parallel sessions downgraded to serial or fewer parallel servers, as max parallel servers exceeded.
    b. Resource Manager group that the job is assigned to, ran out of resources which were allocated to it.
    c. Flood of Logon/logoff’s caused by application can cause degraded database performance.
    d. Caused by SGA component re-sizing where Automatic Shared Memory Management is enabled.
    e. In RAC environment, queries effected by by some rolling upgrade/patches.

    Comment by Natik Ameen — December 25, 2013 @ 12:44 am BST Dec 25,2013 | Reply

    • Natik,
      I think (e) is outside the scope of “random changes” – and most of the others are covered by “resource competition”, but I like (b): the SQL didn’t do any work differently, the resource manager was just just stopping it from running – added at 19.

      Comment by Jonathan Lewis — December 31, 2013 @ 4:01 pm BST Dec 31,2013 | Reply

  19. File this under “just when I thought I knew a little bit about how Oracle worked….” but how could a shrink space against a table cause a performance problem? If you hadn’t already called out changes in execution plans earlier above, I would assume this meant a plan that was once efficient due to how the blocks were situated was rendered suboptimal now after a shrink space and the blocks are now compacted.

    Comment by Alex — December 26, 2013 @ 8:04 pm BST Dec 26,2013 | Reply

    • When you shrink a table you are re-arranging data from the end of the table to the beginning of the table. This, obviously, will change the clustering factor (CF) of the attached indexes. A change of this index property (CF) might damage the desirability of that index, and might make the CBO choosing a full table scan or another not precise index range scan altering consequently the performance of the query.

      Comment by hourim — December 27, 2013 @ 2:46 pm BST Dec 27,2013 | Reply

    • Alex,

      There is some overlap with other reasons – so I’ve added in 14.1, 2, and 3

      Comment by Jonathan Lewis — December 31, 2013 @ 4:07 pm BST Dec 31,2013 | Reply

  20. A large contiguous chunk of empty blocks at the beginning of a table that is the subject of a where exists predicate that is done as a scan with a stopkey. This can come and go like Jekyl and Hyde due to the mere insertion or deletion of a row near the beginning of a table. Similarly this can affect things like intentional stopkeys to process some number of rows at a time, when fewer/more of the requisite number of rows appears before the large empty chunk. Often associated with appending new rows (direct) into transaction tables and slowly cleaning out older rows (emptying the front over time). Then onesy-twosey insertions made by on-line processes may temporarily fill in some of the low blocks if as they come up as insertion candidate blocks. (So knowing Oracle does not maintain a low water mark, that direct appends take place high, and that you cannot control the order of the free block list become design issues.)

    Comment by rsiz — December 27, 2013 @ 2:16 pm BST Dec 27,2013 | Reply

  21. How about:
    - Due to varying number of sessions, the PGA given to (especially) a batch process is much less than usual. This forces much more use of TEMP for multi-pass sorts or hash joins.
    - The query has the same plan as always, but a query within a PL/SQL function called by the outer query has a different plan.

    Comment by Mike Tefft — January 2, 2014 @ 7:12 pm BST Jan 2,2014 | Reply

    • Mike,

      I think the first one falls into the general “competition for resources” changing.
      The second one is a nice one – added in at 20. .

      Comment by Jonathan Lewis — January 2, 2014 @ 7:44 pm BST Jan 2,2014 | Reply

  22. How about previous result cache gets invalidated?

    Comment by Joaquin Gonzalez — January 7, 2014 @ 4:24 pm BST Jan 7,2014 | Reply

    • Joaquin,

      Result Cache invalidation can introduce a lot of instability – but it tends to be a more persistent problem, i.e. the query doesn’t tend to be stable for a long time with occasional extreme changes; invalidation tends (at least in my experience) to happen relatively often.

      Comment by Jonathan Lewis — January 10, 2014 @ 9:40 am BST Jan 10,2014 | Reply

  23. I saw recently quite a lot on 11.2.0.3 in a fairly concurrent environment (20ish parallel sessions doing batch processing) dynamic sampling level 2 arriving at anything from 500k to 32m! on a 2m rows tmp table in a 2 table join leading to a wrong order on hash join which took a few times longer to finish and consumed a lot more additional tmp space for the hash join. Arguably could be added as a 2.2 (ie dynamic sampling results fluctuation on tmp tables).

    Comment by Alex — January 10, 2014 @ 9:29 am BST Jan 10,2014 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,507 other followers