Oracle Scratchpad

September 3, 2009

Queue Time

Filed under: Performance,Troubleshooting — Jonathan Lewis @ 6:44 pm BST Sep 3,2009

Since Richard Foote has started to encroach on my territory (by writing about the CBO), I’ve decided to responsed by moving briefly into Cary Millsap’s speciality (by writing about queueing). I don’t intend to get very technical, though, I just want to give an example of how queue theory relates to Oracle by answering a question I got from a client a few weeks ago:


    “How can they be complaining that response times are worse, the throughput is up by 5%?”

The unfortunate answer, of course, is that the response time might be worse because the throughput is better; and all I want to do in this note is give you a carefully constructed example to show how this can happen:

  • Assume you have machine with one CPU.
  • Assume you have two processes that wake up periodically to do some work.
  • Process 1 runs a job that uses 0.1 CPU seconds (and no other resources), produces N units of output per executions and wakes up once per second to run.
  • Process 2 runs a job that uses 0.5 CPU seconds (and no other resources), produced 5N units of output per execution and wakes up once every five seconds to run.

In any ten seconds, each process (running alone on the machine) uses 10% of the CPU and produced 10N units of output, and response time matches CPU time. But what happens when both processes are started up at roughly the same time.

If you’re lucky process 2 will always start running its job shortly after process 1 has just finished a run, and finish its job shortly before process 1 starts its next run.

If you’re unlucky both processes will start a run simultaneously – and only one of them will get the CPU. At this point, a typical machine will be using time-slicing to make it appear that the two processes are actually running concurrently, so the two jobs will start switching on and off the CPU every 0.01 seconds (say). The (approximate) effect of this is that process 1 completes its job after 0.2 seconds having spent 0.1 seconds working and 0.1 seconds waiting, and process 2 completes its job after 0.6 seconds having spent 0.5 seconds working and 0.1 seconds waiting.

Response times are worse – dramatically so in the case of process 1. We have plenty of spare capacity (80%, in fact) on the CPU, but the timing of the arrival of jobs makes a big difference to the response time for each job.

In the case of my client, we had a lot more processes like process 1 running, and used some of the spare capacity on the machine to push through a lot more of the 0.1 second tasks – so our throughput went up; but as we increased the number of tasks, we increased the chances of them colliding with the 0.5 second job (and with each other) so individual response times got worse.

To move from my trivial example to a more realistic model of the world you need Queue Theory. I made my example as simple as possible with a fixed arrival rate for two tasks of fixed length arriving at regular intervals. To model the real world you need to think about tasks of variable length arriving at randomly distributed intervals – and the mathematics gets a bit harder.

But you don’t need to follow the details of the mathematics to understand the critical consequences: response times can vary significantly because of arrival time even when the machine is far from fully loaded, and response time can get worse even when (or possibly because) throughput is improving.

For more comments on the response time/throughput dilemma, see this item by Doug Burns.

19 Comments »

  1. Oh, where do I begin? At first I was going to start with

    started to encroach on my territory

    You chaps have *territories*? I really must keep up.

    Next, as I read more, I was thinking ‘tell me about it. This is what I’m spending quite a lot of my time at the moment trying to convince people of’.

    Then, at the end, you posted a link. Thanks ;-)

    Comment by Doug Burns — September 3, 2009 @ 9:12 pm BST Sep 3,2009 | Reply

    • Oh, where do I begin? At first I was going to start with
      started to encroach on my territory
      You chaps have *territories*? I really must keep up.

      Irony – is that like steely only less shiny ?

      Comment by Jonathan Lewis — September 4, 2009 @ 9:46 am BST Sep 4,2009 | Reply

      • Noooo, Doug is more a Bronzy type of person. Though I hear he is going up in the world, so he might be getting especially shiny, towards Silvery…

        This is “Mr AWR” we are talking about :-)

        Comment by mwidlake — September 4, 2009 @ 10:08 pm BST Sep 4,2009 | Reply

  2. “Territory”

    Looks like a war is comming! LOL! :lol:

    Comment by lascoltodelvenerdi — September 4, 2009 @ 8:46 am BST Sep 4,2009 | Reply

  3. @Jonathan

    With your example you can also explain why parallel query sometimes perform bad.

    With parallel query more processes spawn, so more “work” for the OS scheduler, so more wait time on the OS side.

    Bye,
    Antonio

    Comment by lascoltodelvenerdi — September 4, 2009 @ 9:07 am BST Sep 4,2009 | Reply

  4. so more “work” for the OS scheduler, so more wait time on the OS side.

    … is something I talk about a bit here (PDF)

    Comment by Doug Burns — September 4, 2009 @ 9:09 am BST Sep 4,2009 | Reply

  5. Doug, Jonathan

    Your comment made me laugh Doug, for a moment i just had a vision of Jonathan with his ice cream van charging round a estate somewhere with Cary in his following ice cream van ready to fight for territory! or maybe just queue up their vans!

    cheers, thanks for the light break in my work!

    Pete

    Comment by Pete Finnigan — September 4, 2009 @ 9:18 am BST Sep 4,2009 | Reply

  6. Turf war. Nice.

    A more extreme example of this is the response of disk drives where the time-slicing between two simultaneous processes, for example full table scans of different tables, has an even higher performance penalty. Two scans that take a total of 20 seconds (10 seconds each) when the drive performs only one at a time can take much longer than 20 seconds total when performed simultaneously because of the penalty imposed by head movement. 30+ seconds would not be at all surprising.

    It would be interesting if an hypothetical RDBMS would recognise this situation and suspend Query B until Query A finishes, giving an average response time of 15 seconds (10 & 20 seconds) instead of an average of 30 seconds (both queries taking 30).

    Comment by David Aldridge — September 4, 2009 @ 9:24 am BST Sep 4,2009 | Reply

  7. I’ve recently talked about the same concept to some clients when discussing performance tuning and analysis – you can both increase total throughput while increasing individual response time at the same time.

    I tried to use the example where a system has quite a lot of spare capacity (idle CPU cycles). It is possible that an increase of 20% in the workload – transactions submitted per minute – could cause greater contention (queuing) and so result in a 10% increase in the response time of each transaction. Although each individual transaction now takes 10% longer time to complete, because we are submitting 20% more transactions per minute the overall effect is an increase in throughput of 10% (20 minus 10).

    And my point to the client was “Which is more important to you? Response time or Total Throughput? Is it the time per transaction, or the elapsed time to process all transactions?”. Generally response time is important to interactive applications, such as those on the web, while throughput is more important to batch processing, such as overnight jobs.

    John

    Comment by John Brady — September 4, 2009 @ 9:28 am BST Sep 4,2009 | Reply

  8. You can get a similar effect where the users plays a major part.

    A report takes 10 minutes to run, doing a lot of CPU, 5o% of it’s execution time, on an 8-cpu machine. No ones runs it much as it takes 10 minutes. The odd execution hogs 50% of a CPU for 10 minutes, leaving 7 and a bit to cope. It might use up 1% of your daily CPU.

    You go and tune the reports, speed it up so it takes 30 seconds. But it now burns a CPU at 100%.
    That uses less resources over all.
    BUT
    People realise the report comes back so much quicker and so they keep using it.

    Now, instead of the odd mild impact, the damn thing is executed once or twice a minute. Now it is taking up between half and 1 CPUs all the time, because you went and made it run more efficiently. As Cary Millsap would point out, sometimes through random chance several versions would be running at any given point (let’s say 4), taking up 50% of your total CPU.

    Don’t make things work too efficiently, people will only go and use it :-)

    Comment by mwidlake — September 4, 2009 @ 10:22 pm BST Sep 4,2009 | Reply

    • I’ve often thought that:

      Bad systems tend to fail because no-one wants to use them

      Good systems tend to fail because everyone wants to use them – so they get overloaded.

      The systems that tend to succeed are the mediocre ones: not so bad that they have to be replaced, but not so good that people want to use them constantly.

      Comment by Jonathan Lewis — September 5, 2009 @ 11:15 am BST Sep 5,2009 | Reply

      • Sounds strange but logical. Anyway, I think of some exceptions – good systems that get more and more loaded and still manage to cope with it – like Google, Amazon, Facebook etc. So it is possible. :)

        Comment by Todor Botev — September 7, 2009 @ 7:46 am BST Sep 7,2009 | Reply

        • … forgot to mention WordPress – which kindly makes this discussion possible. :)

          Comment by Todor Botev — September 7, 2009 @ 7:49 am BST Sep 7,2009

  9. This is also Craig Shalahamer (Orapub) speciality with probably the best book ever written on the subject and dedicated to Oracle :

    “Forecasting Oracle Performance”

    http://resources.orapub.com/Forecasting_Oracle_Performance_Book_p/fop_book.htm

    Comment by Olivier — September 7, 2009 @ 7:52 am BST Sep 7,2009 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 4,012 other followers