Oracle Scratchpad

November 12, 2009

No change

Filed under: Troubleshooting — Jonathan Lewis @ 7:50 pm BST Nov 12,2009

Every now and again I see a note on the OTN database forum, or Metalink, or the newsgroup asking why the time taken to run a query can vary so much between executions … even when the plan didn’t change (and the bind variable used as inputs didn’t change).

Just off the top of my head, here are a few possibilities:

  • Sometimes the required data is perfectly cached, sometimes it isn’t
  • Sometimes other processes are using the CPU or disks very aggressively
  • The volume of data can change significantly over short periods of time
  • A long-running transaction may require you to do more work generating read-consistent data sometimes.
  • A small change in data can cause a huge  variation in the number of times a subquery runs
  • A plan involving a hash join or sort may execute optimally on one occasion and multi-pass on another
  • A parallel plan may get all the PX slaves it wants on one occasion and none of them on another

Any further suggestions are welcome – remember, though, we are assuming that (a) the plan really didn’t change, (b) the input bind values really didn’t change, (c) your environment didn’t change and (d) there isn’t a mad scientist (or novice DBA) in the background doing unusual things to a production system.

23 Comments »

  1. +
    * A query might be affected by delayed block clean out
    * A parallel query may be affected by a segment-level checkpoint
    * An update statement may restart (potentially more than once)

    Comment by Alex Fatkulin — November 12, 2009 @ 8:23 pm BST Nov 12,2009 | Reply

    • Hi Alex,

      >* A parallel query may be affected by a segment-level checkpoint
      Or even worse – tablespace level in versions prior to 10gR2

      Comment by Timur Akhmadeev — November 12, 2009 @ 8:53 pm BST Nov 12,2009 | Reply

      • Timur,

        The object-level checkpoint for parallel query appeared in 8.0 (according to an old note I have on my website). I guess it’s possible that it changed to a tablespace checkpoint between 8i and 10g somewhere and then changed back, though.

        Comment by Jonathan Lewis — November 12, 2009 @ 9:47 pm BST Nov 12,2009 | Reply

        • Jonathan,

          that’s surely possible. I’ve checked a 8.1.7 instance and there’s no such statistic in v$sysstat:

          SQL> select name from v$sysstat where lower(name) like '%flush%';
           
          NAME
          ----------------------------------------------------------------
          global cache cr block flush time
           
          SQL> select * from v$version where rownum = 1;
           
          BANNER
          ----------------------------------------------------------------
          Oracle8i Enterprise Edition Release 8.1.7.0.0 - Production

          I found mentioning of “a new feature” in this white paper:
          “Prior to Oracle Database 10g Release 2, the checkpoint request is handled by issuing a checkpoint for the tablespace the object belongs to, writing out all the dirty buffers for the entire tablespace. Since a large number of objects may reside in the same tablespace, this implementation may cause large number of unnecessary disk writes. With the new release, a checkpoint request for a target object will only write out the dirty buffers of that object, without incurring any additional writes for the dirty buffers of other objects.”

          Comment by Timur Akhmadeev — November 13, 2009 @ 8:19 am BST Nov 13,2009

        • Timur,

          You’re right – although I wouldn’t assume the feature has disappeared simply because a given statistic name has disappeared. I did a quick check on an old 8.1.7.4 system (single instance), and the resulting checkpoint was clearly a tablespace checkpoint.

          With hindsight, it’s possible that 8.0 never had an “object checkpoint” – the thing may have been in a cross-instance call that said: “checkpoint object X”, but it’s possible that the receiving database then did a tablespace checkpoint of the appropriate tablespace – with the same behaviour appearing even when running single instance (as in my website example).

          The 10g2 feature that allows for fast segment checkpoints is one that may have been introduced to allow fast dropping or truncating of objects (which could have clean blocks in the buffer cache) – when I first noticed it, it didn’t cross my mind that it might be a clue that I was wrong about segment checkpoints.

          Comment by Jonathan Lewis — November 13, 2009 @ 9:18 am BST Nov 13,2009

  2. * Bugs (aka sh*t happens)! :-) For example, your query is executed in different schema with the same objects but different data – there are such bugs and I really hate them.
    * The plan didn’t change, but it was huge query, the plan was scavenged and it took some time to parse the query again
    * Other sessions performs lots of work with protected resources (CBC, shared cursors) and your session wants to use another tiny thing of the same structure too
    * There’s a fault (in memory, CPUs, disks, data – in server, network or client) and it is smoothly handled by infrastructure and OS (though this situation might be considered the same as different processes uses resources heavily). Not sure if it fits “environment didn’t change”.
    * Addition to the PX: A parallel plan may get all the PX slaves on different nodes in RAC instead of the current (and everything runs just fine in the latter case)
    * A measurement was incorrect – let’s say, DST issues kicks in :)

    PS. caching is the #1, of course.

    Comment by Timur Akhmadeev — November 12, 2009 @ 8:50 pm BST Nov 12,2009 | Reply

  3. I’m dealing with this problem just now!

    Got 3 system, 3 10g, you run the query now, 15 seconds. Rerun it after a while, 10 sec. Run it now: after 10 minutes the query is still running.

    After a bit of investigation, I saw that in the 10 min scenario a lot of message passing between consumer and producer append.

    I know what, but not why…for now!

    Comment by lascoltodelvenerdi — November 12, 2009 @ 9:08 pm BST Nov 12,2009 | Reply

    • lascoltodelvenerdi,

      consumer/producer – sounds like you’re talking about running parallel query. A couple of possibilities:

        The query involves a “ranger” step to work out a range-based distribution and the range picks an unlucky sample and distributes the data badly so that one slave does most of the work.

        A small change in the data size makes the granule distribution change slightly, and most of the interesting data ends up going to one of the slaves by simple bad luck.

      You can check up on the data and work distribution if you can query v$pq_tqstat after running the query – another link from Feb 1999

      Comment by Jonathan Lewis — November 12, 2009 @ 9:52 pm BST Nov 12,2009 | Reply

  4. * Heavy reliance on the operating system file buffer (buffered I/O) – sometimes a lot of memory is available to be used for the operating system’s file buffer cache (which is also available to cache Oracle blocks), and other times little memory is available for the operating system’s file buffer to cache Oracle blocks.
    * Occasionally exceeding the physical memory in the server which occasionally causes excessive usage of the swap file. What Oracle reports as an in-memory block read may actually require reading the block from the swap file.
    * Changes in network performance between the server and client computer. This could be a symptom of failing network equipment (or a bad routing table entry). This could also be a symptom of the network administrator setting up a switch with a port specified as a network monitoring port, thus causing all network packets on the switch to be duplicated to one of the switch’s ports (required to detect network intrusions) – if the total bandwidth of data in the switch pushed to that port exceeds the attached device’s ability to quickly read the packets, or exceeds the maximum speed of the monitoring switch port, all network traffic on the switch will be constrained (this could be an intermittent problem).
    * Occasional contention in the SAN caused by another instance, or activity unrelated to the Oracle database (such as also using the SAN for file sharing, a file backup destination, etc.).
    * A statistics gathering process forces other blocks used by the query from the Oracle buffer cache (this might not be as much of a problem on recent releases).
    * Changing the buffer pool specification (DEFAULT, KEEP, RECYCLE) for one or more of the objects involved in the query.

    Comment by Charles Hooper — November 12, 2009 @ 9:11 pm BST Nov 12,2009 | Reply

  5. Too many sessions are running the same query/statement. I have seen 20 sessions running the same query and 1 session reading the data from the files while the other 19 wait on read by other session.

    Comment by Daniel Fink — November 12, 2009 @ 9:56 pm BST Nov 12,2009 | Reply

  6. Self-tuning db decides to tune itself?

    Comment by joel garry — November 12, 2009 @ 11:00 pm BST Nov 12,2009 | Reply

  7. Jonathan,

    I started executing the same query with the same bind variables since several days (the query is returning, and will be always, returning a fixed number of records (33 records) ). Sometimes it is fast and some times it is slow.
    The following points can be said about the possible reasons to this random response time
    (3) The volume of data can change significantly over short periods of time
    It is not the case for me because all over the day there is only selects in this particular database

    (4) A long-running transaction may require you to do more work generating read-consistent data sometimes.
    It is not the case for me because all over the day there is only selects in this particular database(no update, no delete, no insert)

    (5) A small change in data can cause a huge variation in the number of times a subquery runs
    It is not the case for me because all over the day there is only selects in this particular database (note that during one day query went from slow to quik and vice-versa)

    (1) Sometimes the required data is perfectly cached, sometimes it isn’t
    I am not able to state about this in my case

    (6) A plan involving a hash join or sort may execute optimally on one occasion and multi-pass on another
    As I said in the OTN thread, when the query is runing very slow, a change of the order by makes it runing very fast. So, I can conclude that when the query decides to be slow, it is because of something during its sort process. The order by process is dynamic and it involves a select on a partitioned table that has been altered together with its locally partitioned index to be parrallel 4 in order to improve a “select/insert where not exists” in a overnight batch process.

    (7) A parallel plan may get all the PX slaves it wants on one occasion and none of them on another
    May be it is the case here due to this select on this partitioned table altered to be parralel 4 and which is involved into the dynamic order by clause.

    Actually the query is runing fast since yesterday at around 13H00. I will post in the OTN thread something when it will decides to be slow!!!

    Mohamed Houri

    Comment by Mohamed Houri — November 13, 2009 @ 10:03 am BST Nov 13,2009 | Reply

    • Mohamed,

      It sounds like the best thing you can do to investigate is to adopt the strategy I used for the case which showed that a large part of the variation was due to caching effects, and recent (or concurrent) activity, viz: wrap the query in a script that checks v$mystat and v$session_event for the work done and time lost. In your case you might also want to include a call to “select * from table(dbms_xplan.display_cursor);” immediately after you run the query so that you can capture the actual plan used in the execution, and a query against v$pq_tqstat to check what went on between the parallel query slaves.

      Comment by Jonathan Lewis — November 14, 2009 @ 8:48 pm BST Nov 14,2009 | Reply

  8. Jonathan,

    one thing about this kind of checkpoint I’d like to point out: it is really “segment-level checkpoint”, not “object-level” like might be considered after reading Oracle white paper. I’ve verified that in 11.2.0.1.

    Comment by Timur Akhmadeev — November 13, 2009 @ 8:52 pm BST Nov 13,2009 | Reply

  9. Timur,

    It’s unfortunate that the documentation (and Metalink, and even some of the white papers) often have little errors of this type in them. In many cases it comes from talking about the basic cases, and overlooking the more subtle cases.

    There are several cases where object != segment, but to many people they are synonymous and it’s easy to forget things like clusters, partitioned tables, IOTs and so on.

    Technically the linked list is based on the data_object_id – following which your comment is a natural consequence.

    Comment by Jonathan Lewis — November 14, 2009 @ 3:36 pm BST Nov 14,2009 | Reply

  10. (late to the party, as always)
    1) RAC – swapping from being allocated to the node hosting and processing a large percentage of the data your query uses to another (or the otheer processing code swapping to a new node).
    or – your RAC nodes are not equal in power (bad situation to have but it happens)
    2) Some other major non-oracle user of resouce is swapping on and off your server/your network/
    your storage. (You mention this right at the start Jonathan, but I think it’s worth repeating as the other resource user may be nothing to do with your database or even Oracle and it so easy to miss that).

    Comment by mwidlake — November 15, 2009 @ 12:17 pm BST Nov 15,2009 | Reply

  11. My vote is for ‘obscure interactions’.

    User defined functions (typically in the SELECT list expressions or in the WHERE clause) introduce a whole separate level of interaction. If they happen to involve queries of their own and are run for each row, the issues previously mentioned may be magnified.

    VPD, query rewrite, other ‘policy’ or ‘trigger’ interactions may have a hidden impact that again exacerbated the standard issues.

    (And, with RAC, one question becomes ‘are you on the node that has the data’?)

    Comment by Hans Forbrich — November 15, 2009 @ 4:29 pm BST Nov 15,2009 | Reply

  12. No change is a relative term. In this case, the question you asked is not changed (The query is not changed) and how the way oracle to find the answer is not changed. (The execution plan is not changed.) Everything else is changing. The demand for resource is changing because your session is complete with others for CPU, memory, IO, network, latches and locks. The data need for the answer is changing with the DML from other sessions. The physical location of the data you need is changing, (data block, index block, undo segment, control file. disk read or memory read.) Even if your session is the only one in database, background processes are still running. Maybe your server is shared. Most likely your SAN is shared. Even if you have a dedicated system and network just to run this query, your server backup is running, Memory management is going (including memory paging) CPU performance is not the same because of Multiple CPUs on system each with millions and millions transistors, The power supply voltage is up or down, the data center air temperature is not a constant, the nearby microwave is leak radiation… Well you got the picture.

    Comment by Wayne Lang — November 15, 2009 @ 5:20 pm BST Nov 15,2009 | Reply

    • Wayne,

      There’s one particularly interesting point in your list that may not be commonly known:

        “The physical location of the data you need is changing, (data block, index block, undo segment, control file. disk read or memory read.)”

      If you’re running on WAFL (write anywhere file layout) which, I believe, is the technology used by ZFS and Netapp and possibly a couple of other platforms then when you update an Oracle block it won’t be written back to the same location on the disk.

      In unlucky circumstances this could introduce a surprising difference to the performance of the same query visiting the same basic data set if that data was “slowly changing”.

      Comment by Jonathan Lewis — November 19, 2009 @ 6:15 pm BST Nov 19,2009 | Reply

  13. I’m surprised nobody has suggested this one yet:

    * Oracle bug

    Comment by mdavis — November 16, 2009 @ 5:58 am BST Nov 16,2009 | Reply

  14. Good Topic. This can turn into a reference-list of “things to look into”.

    Did anyone mention the “front-end” yet ?
    Is it possible that the calling-process (java, sqlplus, dotnet, whatever) may simply take its time peeling off the data record by record whereby network-latency and front-end-sluggishness kick in ?

    Comment by PdV — November 21, 2009 @ 1:55 pm BST Nov 21,2009 | Reply

  15. I had a recent situation where network latency was affected for a given server doing inserts of data collected from external devices elsewhere in the network. Everything was the same, locally, and it took a time to track down the fact that network latency was causing the problem. Another problem of this nature was network throughput being affected on this same system by a load – in this case from network backups.

    Things can get really interesting when queries are done across a network, and folks are not that forthcoming about looking into these problems first before complaining about database performance.

    Comment by Michael Fontana — November 23, 2009 @ 9:45 pm BST Nov 23,2009 | Reply

  16. Networks can so easily be forgotten. I’ve been to several sites that had to be told why a report that ran in seconds in London was taking minutes in New York. (After all, if people keep telling you that “SQL*Net message from client” is an idle wait … sometimes it’s the most important thing going on.)

    Comment by Jonathan Lewis — November 25, 2009 @ 1:22 pm BST Nov 25,2009 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,454 other followers