Oracle Scratchpad

February 11, 2013

Optimisation ?

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 2:05 pm GMT Feb 11,2013

I was at a client site recently where one of the end-users seemed to have discovered a cunning strategy for optimising a critical SQL statement. His problem was that his query screen times out after 2 minutes, so any query he runs has to complete in less than two minutes or he doesn’t see the results. Unfortunately he had a particular query which took nearly 32 minutes from cold to complete – partly because it’s a seven-table join using ANSI OUTER joins, against tables ranging through the 10s of millions of rows and gigabytes of data – the (necessary) tablescan of the table that had to be first in the join order took 70 seconds alone.

But our intrepid user seems to have made an important discovery and engineered a solution to his performance problem. I think he’s noticed that when you run a query twice in a row the second execution is often faster than the first. I can’t think of any other reason why the same person would run the same query roughly every four minutes between 8:00 and 9:00 am every morning (and then do the same again around 5:00 in the afternoon).

Looking at the SQL Monitoring screen around 10:00 the first day I was on-site I noticed this query with a very pretty graphic effect of gradually shrinking blue bars as 32 minutes of I/O turned into 2 minutes of CPU over the course of 8 consecutive executions which reported run times something like:  32 minutes, 25 minutes, 18 minutes, 12 minutes, 6 minutes, 4 minutes, 2.1 minutes, 2 minutes.

It’s lucky (for that user) that the db_cache_size is 60GB. On the other hand this machine is one of those Solaris boxes that likes to pretend that it’s got 128 CPUs when really it’s only 16 cores with 8 lightweight threads per core – you don’t want anyone running a query that uses 2 solid CPU minute on one of those boxes because it’s taking out 1/16th of your CPU availability, while reporting a load of 1/128 of your CPUs.

Footnote: the query can be optimised (properly) – it accessed roughly 100M rows of data to return roughly 300 rows (with no aggregation), so we just need to do a little bit of work on precise access paths.

14 Comments »

  1. Hmmm, coincidence perhaps but we were thinking on similar lines last week here. We have a specific query that usually takes a long time when run first time as it loads-up/accesses data (50k plus rows) (hibernate plus ansi left outer joins, need i say more?). But in our case we were thinking of running that specific query only once everytime we restart apps which is about once a week. That query is used few times a day, and incidentally our sga is about 64G. The decision is not yet done, but we were discussing it very last week. and in our case no it isn’t Solaris.

    Comment by Raj jamadagni — February 11, 2013 @ 2:59 pm GMT Feb 11,2013 | Reply

  2. I had done almost the same just before you published this article, but my aim was not to do noticeable workload on customer database.
    Ideas are flying in air.

    Comment by Yuri — February 11, 2013 @ 3:49 pm GMT Feb 11,2013 | Reply

  3. Hum … very nice optimization trick :-) I think i will schedulle all our application queries once per minute :)))))))))

    Comment by olivier bernhard — February 11, 2013 @ 5:08 pm GMT Feb 11,2013 | Reply

  4. It’s a reminder of the days when any DBA worth his salt would be coding to run full table scans of particular tables to get them cached on instance startup following the Sunday night cold backup.

    Good times …

    What’s the significance of the ANSI Outer Join, Jonathan? This isn’t one of those 9i can’t-swap-driving-tables-on-outer-join things is it?

    Comment by David Aldridge — February 11, 2013 @ 7:01 pm GMT Feb 11,2013 | Reply

    • David,

      “Cold backup” – long time since I’ve done one of those ;)
      Still, no matter how bizarre an experiment I come up with, someone always comes up with an example of why it might make sense occasionally.

      No great significance in slipping in the ANSI – it just happened to be the case, and ANSI SQL does make it that little bit harder to generate the necessary SQL Baseline.

      Comment by Jonathan Lewis — February 13, 2013 @ 5:01 pm GMT Feb 13,2013 | Reply

      • Ah, cold backup is still with us in some circles, and it’s extraordinary to see the lack of competence that is still “out there”. A couple of years ago I saw “hot backups” in NOARCHIVELOG mode in a system about to go into production for a critical client, with a decison taken not to “take the risk” of moving to ARCHIVELOG mode before the switchover.

        In fact every sort of monstrosity and perversion that Oracle experts have been railing against for the past 15 years still has its set of “Old Believers”, sad to say.

        Comment by David Aldridge — February 14, 2013 @ 8:50 am GMT Feb 14,2013 | Reply

        • David,

          The problem with the “Old Believers” is that that old beliefs don’t get unpublished on the internet, and there are plenty of people who “tune by google” who find it, and then republish it just to have something in their blogs.

          Comment by Jonathan Lewis — February 16, 2013 @ 9:12 am GMT Feb 16,2013

  5. It’s not that complex.. The typical person treats computers like doors .. keep banging at the door sooner or later it will open :)

    Comment by Peter Shankey — February 12, 2013 @ 11:53 am GMT Feb 12,2013 | Reply

    • or you know what does DBA-team monitoring and interrupt query timely to prevent them of observing you :)

      Comment by Yuri — February 12, 2013 @ 12:00 pm GMT Feb 12,2013 | Reply

  6. >>On the other hand this machine is one of those Solaris boxes that likes to pretend that it’s got 128 CPUs when really it’s only 16 cores with 8 lightweight threads per core – you don’t want anyone running a query that uses 2 solid CPU minute on one of those boxes because it’s taking out 1/16th of your CPU availability, while reporting a load of 1/128 of your CPUs.

    Johathan –
    Apologies for asking system side naive question..am not a systems guy and not understanding the core vs thread difference.
    Can you throw some light into the above statement?

    (dealing with similar situation.. more than 1 db running on solaris box.. the overall load is always reported low..but from experience I have noticed, performance on 1 database varies depending on what is run on other database.. understand its all using same physical resources..but I am trying to get some pointers, it *is* a concern even if system is not running under max resource utilization )

    Thanks

    Comment by Oraboy — February 16, 2013 @ 5:59 pm GMT Feb 16,2013 | Reply

    • Oraboy,

      This isn’t my specialist subject either – when I’m stuck on CPUs etc. I talk to people like Kevin Closson and James Morle to help me out.

      Basic sketch: a single chip – plugged into a single socket – can contain multiple CPU Cores; a single CPU Core may be able to operate in a multi-threaded fashion (2 threads being fairly common nowadays) – and Oracle typically reports each possible thread as a CPU. So my laptop (for example) is reported by Oracle as 1 socket, 4 Cores and 8 CPUs (because it runs two threads per core).

      A multi-threaded Core is, however, only capable of doing one thing at a time. Just as your operating system can appear to be doing many things at once because of time-slicing and scheduling, your core is switching between threads to give the impression of multiple threads running simultaneously.

      In the case of the core, though, the switches only take place (I believe, though I may be out of date) when a thread “stalls” – by analogy, think of an Oracle session running for while, then having to wait for a “db file sequential read”, the session “stalls”, and the O/S switches to running another session. In the case of the thread stalling, an equivalent task would be loading a page of main memory into the processor cache – the thread stalls, so a different thread starts. A task that avoids stalling a thread can effectively take over the core.

      As far as your shared hardware is concerned – if by “load” you mean CPU consumption, then it is perfectly feasable that one of your databases could be using a far larger percentage of the available CORE time than the CPU statistics suggests, leading to the behaviour you’re seeing.

      Comment by Jonathan Lewis — February 18, 2013 @ 7:37 am GMT Feb 18,2013 | Reply

  7. Jonathan –
    Thanks much.

    >>if by “load” you mean CPU consumption
    Not just CPU consumption.. I am looking at CPU utilization (%usr and %sys ) , CPU load ( load average & queue length in vmstat)

    nothing tells me I am anywhere close to reaching max limit. (in other words, no pattern where cpu utilization is running at beyond 80% ..and no huge pile up on run-queue either)

    But there are occasions where some operations take longer and being a simple dba (scope strictly limited to database) I cannot seem to find a justification.

    Perhaps, I can take your lead on this “A task that avoids stalling a thread can effectively take over the core.” & see if I can simulate some operation within a db, that on system-side would run continuously on a thread without letting the core to be shared by anyone else.

    Thanks again for your time.

    Comment by Oraboy — February 19, 2013 @ 6:46 pm GMT Feb 19,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,422 other followers