Oracle Scratchpad

June 25, 2013

System Stats

Filed under: CBO,Oracle,Parallel Execution,Statistics,System Stats — Jonathan Lewis @ 5:27 pm BST Jun 25,2013

Several years ago I wrote the following in “Cost Based Oracle – Fundamentals” (p.47):

The maxthr and slavethr figures relate to throughput for parallel execution slaves. I believe that the figures somehow control the maximum degree of parallelism that any given query may operate at by recording the maximum rate at which slaves have historically been able to operate—but I have not been able to verify this.

Browsing the internet recently, I discovered that that no-one else seems to have published anything to verify my comment, so I decided it was about time I did so myself.  I’m going to work up to it in two blog notes , so if you do happen to know of any document that describes the impact of maxthr and slavethr on the optimizer’s costing algorithms please give me a reference in the comments – that way I might not have to write the second note.

Part one, then, how does Oracle cost a simple, serial, tablescan ? Let’s set up some system statistics, and create a table to work on. As usual I’m working with a locally managed tablespace, 1 MB uniform extents, 8KB block size, and freelist management. Any results come from 11.2.0.3, but are also valid for 10.2.0.5:

begin
	dbms_stats.delete_system_stats;
	dbms_stats.set_system_stats('MBRC',64);
	dbms_stats.set_system_stats('MREADTIM',10);
	dbms_stats.set_system_stats('SREADTIM',5);
	dbms_stats.set_system_stats('CPUSPEED',2000);

end;
/

create table t1
pctfree 99 pctused 1
as
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
	rownum			n1,
	rpad('x',100,'x')	v1
from
	generator	v1,
	generator	v2
where
	rownum <= 4e4 ; begin 	dbms_stats.gather_table_stats( 		ownname		 => user,
		tabname		 =>'T1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

I’ve defined the table in a way that ensures I get one row per block, and with 40,000 rows (4e4) the table stats report 40,000 blocks. You’ll notice that I haven’t set the maxthr and slavethr values after deleting system stats. The figures I have used tell Oracle that a single block read takes 5 ms, a multiblock read takes 10ms and a typical multiblock read will be 64 blocks (512KB).

To check costing, I’m going to use explain plan for a simple query, using dbms_xplan to report the plan, but following this up by selecting a couple columns from one of the rows in the plan table. The following text shows the SQL I used, followed by the two sets of results:


explain plan for
select max(n1) from t1;

select * from table(dbms_xplan.display);

select
	io_cost, cpu_cost, cost
from	plan_table
where	options = 'FULL'
;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  1280   (3)| 00:00:07 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 40000 |   195K|  1280   (3)| 00:00:07 |
---------------------------------------------------------------------------

   IO_COST   CPU_COST       COST
---------- ---------- ----------
      1251  290857600       1280

The plan shows us that the cost of the query was 1280, with a predicted run-time of 7 seconds.
The secondary query shows use that the total case was made up of an I/O cost of 1251, and component derived from a CPU cost of 290,857,600.
I want to answer the following questions:

Where does the 1251 come from ?
How does a CPU cost of 290M turn into an incremental cost of 29 ?
What’s the relationship between a cost of 1280 and 7 seconds ?

The I/O cost:

The system statistics mreadtim and mbrc tell us that we can read 64 blocks in 10 ms, and the table statistics tell us that we have 40,000 blocks to read. So we have to do 625 (40,000 / 64) multiblock reads to read the table, for a total read time of 6,250 ms.

The optimizer presents this result in terms of the number of single block reads that could be performed in the time – which is 1,250 (6,250/5); then it applies the rule dictated by the parameter setting “_table_scan_cost_plus_one” and adds one to get 1,251 (Q.E.D)

The CPU cost

I’m not going to go into details about how Oracle gets a CPU cost of about 290M. There’s a CPU cost for acquiring a block, a cost for finding a row in a block, a cost for finding a column in a row, a cost for doing something with the column, and so on. Suffice to say that for handling my 40,000 rows/blocks the optimizer has estimated 290M “operations”.

The system statistic cpuspeed says that the machine can do 2000 million CPU operations per second; and I need to do 290M, which means I need 290/2000 seconds, or 145 ms of CPU time. Again, though, we convert time to its “single block read” equivalent at 5 ms per read; so my CPU time is equivalent to 29 (145/5) single block reads. And that’s the different between the I/O cost and the final cost. (You might also note that 29/1280 is 0.023 – which is close enough to the 3% CPU recorded in the Cost column of the plan.)

Time

How does a cost of 1280 turn into a time of seven seconds ? Actually, we were doing that trick backwards in the previous two section – the conversion factor is the single block read time. 1280 single block reads at 5 ms per read = 6,400 ms, which rounds up to 7 seconds.

Parallelism

We can take our first steps towards understanding parallel costs and the impact of maxthr and slavethr even when the parameters haven’t been set. Here are the two sets of results from running my test with the hint /*+ parallel(t1 5) */ and then then hint /*+ parallel(t1 42) */. I picked the value 42 for the second parallel run because the system I was working on had parallel_max_servers set to 40.

First parallel 5:

----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     5 |   278   (0)| 00:00:02 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 40000 |   195K|   278   (0)| 00:00:02 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 40000 |   195K|   278   (0)| 00:00:02 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

   IO_COST   CPU_COST       COST
---------- ---------- ----------
       278    1333333        278

The IO_COST is 278 compared to the original 1251. There are two factors in the arithmetic: one is the degree of parallelism, the other is an “arbitrary” factor of 90% that the optimizer introduces, presumably in anticipation of some interference between parallel execution slaves. There are several routes through the numbers that the optimizer might actually take, but working backwards we can express the result as ((1251 – 1) / 5 ) / 0.9 + 1. (I’m not entirely sure about the validity of the +1, but the -1/+1 are catering for the “tablescan cost plus one”, and I can’t get complete consistency on whether to round, truncate, or take ceilings in the calculations – but the approximation is good enough to demonstrate the principle)

If you look at the CPU cost you’ll see that it’s much less than you might expect – the serial CPU cost was roughly 290M, the parallel 5 cost is just over 1M. We should probably assume that this is because the CPU cost of the direct path reads for parallel execution is considered to be far less than the CPU cost of handling blocks in the buffer cache.

Now look at the plan for parallel 42


----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     5 |    33   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 40000 |   195K|    33   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 40000 |   195K|    33   (0)| 00:00:01 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

   IO_COST   CPU_COST       COST
---------- ---------- ----------
        33     158730         33

As before we can derive the I/O cost of 33 from the serial I/O cost: ((1251 – 1)/42)/.9 = 33 (ignoring the plus one and rounding, in this case).
If you look at the CPU cost and compare it to the CPU cost of parallel 5 you can see that 158,730 = 5 * 1,333,333 / 42: the CPU costs (once we’ve ignored the buffer caching CPU costs) are scaling with the degree of parallelism.

This parallel 42 plan does highlight a problem, though – my system can’t run parallel 42, it doesn’t have enough slave, but the optimizer simply applies the arithmetic indicated by the hint. So the question arises – can we supply some statistical information that stops this silly arithmetic appearing ? The answer will appear in part 2 of this topic.

7 Comments »

  1. Hi Jonathan,

    Working in a huge DWH environment, I was always intimidated by the costing of parallel execution. Thank you for this simple and clear demonstration of how can we relate the costing of parallel v/s serial queries wrt system stats.

    Regards,
    Saifee

    Comment by Saifee — June 26, 2013 @ 2:42 am BST Jun 26,2013 | Reply

  2. […] there hasn’t been a “maxthr – 1″, I called the first part of this series“System Stats”. If you look back at it you’ll see that I set up some system statistics, excluding the maxthr […]

    Pingback by maxthr – 2 | Oracle Scratchpad — June 27, 2013 @ 5:08 pm BST Jun 27,2013 | Reply

  3. And it’s interesting that CPU costing becomes inaccurate in case either serial direct read or in-memory PQ kicks in since these are still costed assuming buffered reads (former) or direct reads (later).

    Comment by Alex Fatkulin — June 27, 2013 @ 11:30 pm BST Jun 27,2013 | Reply

    • Alex,

      Good comment – and one of the reasons I like the pooling of ideas that can happen with blogging.

      I’d be inclined to think it’s not important for serial direct reads – the optimize chose the table (or index fast full) scan path because, in theory, it was the most cost-effective. If the run-time engine then decides to go direct path the CPU cost is even less than the assumed CPU cost – so Oracle is doing better than predicted. (On the other hand, mechanically an object checkpoint has to complete first, and the optimizer hasn’t allowed for that in its calculations !).

      By comparison, the in-memory PQ is one that that really needs a lot of thinking through.

      Comment by Jonathan Lewis — June 28, 2013 @ 11:21 am BST Jun 28,2013 | Reply

      • I have a case in my prod system,where the normal FTS is faster than Direct path (without parallel execution) due to blocks being cached in the buffer cache. I’m not sure if the engine takes cached blocks into consideration when choosing Direct Path or normal path, but I get, with the same plan, different performance depeding on which path Oracle chooses. Is there any way to influence Oracle in order to choose one or the other?
        Cheers

        Pablo

        Comment by Pablo — July 4, 2013 @ 2:51 pm BST Jul 4,2013 | Reply

        • Pablo,

          The decision to use serial direct path is a run-time decision, not an optimizer one. It is affected by the number of blocks of the segment that are currently in the cache compared to the total size of the segment – Tanel Poder has some comments about this that sum things up nicely.

          It is possible to disable serial direct path reads through the hidden parameter (usual warning applies) “_serial_direct_read” which can take the values true, false, always, never, auto (the default). If you want to do a little more research around the topic a google search on the parameter will turn up a number of interesting articles, for example Alexander Anokhin has an interesting note about the KEEP cache and direct reads.

          Comment by Jonathan Lewis — July 4, 2013 @ 7:18 pm BST Jul 4,2013

  4. […] part 1 of this mini-series we looked at the effects of costing a tablescan serially and then parallel when […]

    Pingback by maxthr – 3 | Oracle Scratchpad — July 3, 2013 @ 6:29 pm BST Jul 3,2013 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,087 other followers