gather_plan_statistics

November 25, 2007

gather_plan_statistics

Filed under: Performance,Statspack,Troubleshooting — Jonathan Lewis @ 7:27 pm GMT Nov 25,2007

I wrote an article some time ago about how 10g gave you a very convenient way to capture run-time information about the work done by a query if you added the /*+ gather_plan_statistics */ hint to your SQL. At the time I pointed out that using this hint could make some queries use a lot more CPU, as they might spend more time collecting the information than they spent doing the actual work of the query.

I was reading through a statspack report recently when it suddenly crossed my mind that in 10g statspack has added O/S statistics to the report – so it might be quite nice to get a rough idea of where all the extra CPU went.

So I ran a little test – using my killer “old connect by” query that you can find in one of my “hit ratio” papers, and wrapped session level statspack snapshots around the outside. The results were interesting:

Baseline – query run time 22.28 seconds:

OS Statistics  DB/Inst: D10G/d10g  Snaps: 86-87
-> ordered by statistic type (CPU use, Virtual Memory, Hardware Config), Name         

Statistic                                  Total
------------------------- ----------------------
BUSY_TIME                                  2,406
IDLE_TIME                                  9,288
SYS_TIME                                     112
USER_TIME                                  2,294

With /*+ gather_plan_statistics */ in place – query runtime 88.32 seconds

OS Statistics  DB/Inst: D10G/d10g  Snaps: 87-88
-> ordered by statistic type (CPU use, Virtual Memory, Hardware Config), Name         

Statistic                                  Total
------------------------- ----------------------
BUSY_TIME                                  8,932
IDLE_TIME                                 14,619
SYS_TIME                                   6,639
USER_TIME                                  2,293

As Greg Rahn pointed out to me in a private email when I posted the original article – the performance of the operating system “timer” function can make a big difference to the overhead

There is a tiny amount of room for error, but most of the 66 second increase is down to the operating system code, which rather confirms Greg’s point: the impact you suffer from this feature may be highly dependent on your operating system.

Comments (26)

26 Comments »

It will be interesting to repeat the same test with 11g.

11g is supposed to speed up the timer functions significantly with the introduction of the VKTM (virtual keeper of time) process.

Comment by Christo Kutrovsky — November 26, 2007 @ 3:57 pm GMT Nov 26,2007 | Reply
Christo,
Good question. I ran the same test on the same box on 11g. Time without collection 16 seconds, time with sampled collection 67 seconds – AWR report showed that the extra time was almost entirely SYS time.

I was on a plane when I ran the test so not on mains power which means the CPU was stepped down. The query run time was still quicker because (even with “_old_connect_by_enabled” set to true), 11g used a different mechanism on my query and only did half the conistent gets that 10g did.

Comment by Jonathan Lewis — November 27, 2007 @ 10:19 pm GMT Nov 27,2007 | Reply
The VKTM time is not used for such timings as it has too low granularity (VKTM process wakes up every 20ms by default). truss -c will easily confirm lots of timing syscalls still made.

It looks like VKTM just keeps the “SGA time” variable up to date which used to be LGWR’s task before 11g.

By “SGA time” I mean the kswsgtim_ variable which holds number of seconds since midnight 1 Jan 1970. Multiple things depend on this variable, like taking of timed OS statistics snapshots and updating session longops array. As this kswsgtim_ variable has 1-second granularity, 11g has introduced a new variable ksudbrmseccnt_ which can have up to microsecond granularity IIRC. Nevertheless, as VKTM wakes up every 20ms only, it can’t have better than 20ms granularity unless you start tweaking the _timer_precision parameter.

Comment by tanelp — December 5, 2007 @ 1:22 am GMT Dec 5,2007 | Reply
Note that if you run truss -c on Solaris you may not see the timing syscalls in truss output. Recent Solaris versions allow use of fast trap syscalls (which are invoked differently from usual syscalls and do not require saving of all registers), thus truss doesn’t capture them.

Comment by tanelp — December 5, 2007 @ 1:41 am GMT Dec 5,2007 | Reply
I’ve noticed that the row source statistics I’m usually most interested into are the number of actual rows above all (to perform “Tuning by Cardinality Feedback”) and the actual buffer gets and buffer read/write from/to disk.

These statistics are simply managed in user space (basic counters) and do not require any system call at all (syscalls are required for elapsed time and cpu time only), and system calls are (AFAIK) the ones that are so expensive (“AFAIK” because maybe, the latching required to store the stats in the library cache contributes not marginally to the resource consumption – even if I would be surprised if it were not negligible compared to syscalls).

I wonder whether it is possible, by some magic hidden parameter perhaps, to enable the one I’m interested only, and disable the others.

Comment by Alberto Dell'Era — December 9, 2007 @ 7:23 pm GMT Dec 9,2007 | Reply
I mentioned the “_rowsource_statistics_sampfreq” in the previous note. On a 10.2.0.1 setting this parameter to zero seems to eliminate a very large fraction of the overhead – although there is still something going on that adds about 20% to my current worst case example.

Comment by Jonathan Lewis — December 10, 2007 @ 10:33 am GMT Dec 10,2007 | Reply
Yes, but “_rowsource_statistics_sampfreq” controls the sampling (as far as I have understood, collect statistics only for one in every _rowsource_statistics_sampfreq starts); I was looking for something to selectively enable only the stats I’m interested into (number of actual rows mainly), that do not require any syscall.

E.g. I would prefer to have all starts accounted for, but only for the stats “A-rows”.

Comment by Alberto Dell'Era — December 10, 2007 @ 12:49 pm GMT Dec 10,2007 | Reply
Alberto, I haven’t looked at this very closely – i.e. tried to make it break – but my memory is that when I was experimenting with the sample frequency to see how much impact it had, it was only the TIME that changed – figures like starts, A-rows, buffers, didn’t seem to be affected by changing the frequency.

It’s possible that the sample frequency is described badly in the view definition and the sampling only applies to the timing.

Comment by Jonathan Lewis — December 12, 2007 @ 11:06 am GMT Dec 12,2007 | Reply
I ran a few tests against Oracle10 and Oracle11 on a Linux machine, and indeed, Oracle11 does fewer calls to gettimeofday, times and getrusage with the same statistic level settings. I ran a test with a simple full table scan (with asynch IO off), and looking at the strace(truss) output i noticed something else. Oracle 10 and earlier uses the traditional readv (reading consecutive data blocks into scattered memory buffers) for multiblock reads. But Oracle11 uses pread for multiblock reads in the exact same test case. Moreover, i could not get Oracle11 to use readv for any FTS size. So it seems that the ‘scattered read’ (readv) is replaced by doing multiblock reads into adjacent memory buffers. Indeed, pread typically has a lower OS overhead than readv for the same number of bytes read, so the switch makes sense. Oracle is apparently using the LRU end a little smarter, full scans go into a part of the LRU that maps to a single memory region instead of scattered memory blocks. Anyone else able to confirm this for full scans on Oracle11 on other platforms?

Comment by Jan-Marten Spit — December 12, 2007 @ 4:29 pm GMT Dec 12,2007 | Reply
Jan-Marten,
it’s possible that your tablescan performed serial “direct path read” calls (i.e. into the process memory) rather than “db file scattered read” calls (i.e. into the buffer).

Comment by Jonathan Lewis — December 12, 2007 @ 6:09 pm GMT Dec 12,2007 | Reply
jonathan,

repeating the test, and looking at session statistics, the FTS reads in 11g are indeed direct, and it also shows pga allocated. It simply did not occur to me that 11g could use direct reads in this case. there is no PX involved, so what can be causing the direct reads?
is 11g finding that the IO subsystem ‘is slow’ and switching to direct path *whereas 100g is not)? The two used RAID1 disks are 5 years old and reporting ~3.4 ms for the multiblock reads involved…

Comment by Jan-Marten Spit — December 12, 2007 @ 9:44 pm GMT Dec 12,2007 | Reply

I’ve been going through your performance data collection exercise and have realized that, in some cases, I am not sure how to account for the time difference between the query total elapsed time and the query CPU time. I used some pretty traditional tools (in order to relate their performance counters to ‘gather_plan_statistics’). Here are my results for an insert statement:

=================
From tkprof:

insert /*+ append*/ into part select * from p_et


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.05       0.14          0        861          0           0
Execute      1     73.14     117.68         21       5011     393311    20000000
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        2     73.19     117.82         21       5872     393311    20000000


Elapsed times include waiting on following events:  

  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  control file sequential read                    4        0.00          0.00
  db file sequential read                        21        0.01          0.03
  direct path write                               4        0.00          0.00
  SQL*Net message to client                       1        0.00          0.00
  SQL*Net message from client                     1        0.00          0.00
                               
===========
From V$SQL,  the same query:
                               
ELAPSED_TIME     118028570
CPU_TIME          73275861
USER_IO_WAIT_TIME    34755
===============================

As you can see, the times extracted from the ‘10046’ query trace file and v$sql statistics are pretty close. What I am wondering about is whether there is a way to find out what the session was doing during the 44 unaccounted for seconds. I realize that Oracle might have spent the time perhaps on flushing db buffers to the disk and such, but is there a way to know exactly what the time was spent on ? The session surely is waiting for something, but for what ? The trace does not contain any wait events. If you have any insight, it’ll surely be appreciated.

Thanks !

Comment by Val Carey — December 13, 2007 @ 2:16 am GMT Dec 13,2007 | Reply

Val,

It’s not possible to make a solid statement about where the lost time went – there may simply be some uninstrmented waits going on. Have you read through the trace file itself to check for any clues ?

One possibility is your run-state. You have 20,000,000 rows inserted with only 4 direct path writes and 21 single block reads !

Possibly you did a lot more direct I/O (reads for a tablescan, writes for the append) which turned into asynch I/O, which consumed a lot of CPU. Your lost time might be your process being runnable but not running because it had used up its time-slice and the operating system had descheduled it and was allowing the asynch drivers to do their thing.

You could check what’s going on in v$osstat for the OS_CPU_WAIT_TIME while this insert was running. I’m not sure it’s 100% meaningful (see https://jonathanlewis.wordpress.com/2007/11/01/awr-oddity/#comments ) but it may give you a clue.

You could also use something like ‘top’ at the operating system to see where the CPU time is going.

Comment by Jonathan Lewis — December 13, 2007 @ 8:52 am GMT Dec 13,2007 | Reply
Jonatahan,

I did look through the raw file, but except the small number of waits that tkprof already saw, there was nothing else. The active session history table snapshots always showed ‘ON CPU’ with no other waits.

The 20 million rows were coming from an external table ‘p_et’ (a flat file 2 GB in size) which explains why there were so few reads. The file itself was most likely heavily cached in memory due to previous runs, so there was almost no IO judging by iostat. Essentially, as I understand the insert amounted to moving memory blocks between the OS file cache and SGA (db buffer).

The timeslice preemption is a good idea, but at the time there was no other process running on the four-cpu 8GB box, so it’s unlikely that the 44 seconds can be explained by that.

I encountered quite a few situation like that where I could not explain a similar difference, but this one is easily reproducible and ‘clean’ in the sense that almost no IO was happening, no concurrent activity was going on, and yet substantial time was spent !

Thanks.

Comment by Val Carey — December 13, 2007 @ 3:29 pm GMT Dec 13,2007 | Reply
Val,

I just ran a quick test.
No waits of any kind recorded for reading from the filesystem – and there really should have been some, even if they were virtually zero. Just 2 waits for direct path writes, although 10,000 blocks written by direct path.

The former could be indicative of an uninstrumented wait, the latter of a possible cause of highly concurrent CPU usage as the async processes take over the writes.

I was getting close to suggesting that 2Gb at 50Mb/s is roughly the 44 seconds – but that doesn’t match with your comment about iostat :(

Comment by Jonathan Lewis — December 13, 2007 @ 5:20 pm GMT Dec 13,2007 | Reply
Jonathan

I’ve rerun the test again, more carefully this time, and the results indicate that you are probably right about the waits being caused by reading from the filesystem although they are not traced.

When I make sure the file is indeed cached in memory by measuring subsequent runs but the first, cpu and elapsed time are almost the same :

From V$SQL:

ELAPSED_TIME 59704747
CPU_TIME 59704231
USER_IO_WAIT_TIME 356

When I said that iostat was not showing any read activity, I was apparently watching the subsequent runs, not the first one due to an error in my script. Still, even with the first run, the filesystem read waits are not accounted for at all in the trace file and only partially in the V$SQL counters as can be seen in my first message.

Thank you.

Val

Comment by Val Carey — December 13, 2007 @ 9:00 pm GMT Dec 13,2007 | Reply
Jan-Marten,
Oracle has been able to do serial direct reads for a long time (the parameter _serial_direct_read exists even in 8i) – but 11g may be the first time that the feature is enabled by default (even though the parameter is set to false in 11g).

I haven’t investigated what triggers the choice – but there must be some type of cost-based decision since any direct reads have to start with an object checkpoint to write any dirty blocks for that object back to disc before the read starts.

Comment by Jonathan Lewis — December 14, 2007 @ 7:54 am GMT Dec 14,2007 | Reply
Hi,

I compared read performance of 10g and 11g some weeks ago and found more or less accidentally the discussed behaviour.
An FTS on a big table answered after 3 minutes in 10g but after 1 minute in 11g due to the direct path reads. Executing the statement a second time it anwered after 1 second in 10g because the blocks were cached but again after 1 minute in 11g. At the first glance it seemed to be a strange behaviour if not a bug. But probably it is a feature to avoid flushing buffer cache by large FTSs.

I performed some tests to influence the behaviour. I tried switching off parallel execution, setting _serial_direct_read=false and changing db_file_multiblock_read_count, without any result. What I’m not tested until now is to change parameters that influence the file IO like filesystemio_option and so on.

Then I changed the size of the table and observed that the turning point is approximately at 10% of the buffer cache. If the table is bigger then a FTS makes direct path read otherwise db file scattered read.

Frank Haney

Comment by Frank Haney — March 19, 2008 @ 8:36 am GMT Mar 19,2008 | Reply
Frank,

Thanks, it’s always useful to have observations like this about new behaviour.

Comment by Jonathan Lewis — March 19, 2008 @ 10:46 am GMT Mar 19,2008 | Reply
[…] Heisenberg Filed under: Execution plans, Troubleshooting — Jonathan Lewis @ 9:14 pm UTC Apr 26,2007 [See also: gather plan statistics] […]

Pingback by Heisenberg « Oracle Scratchpad — July 1, 2008 @ 7:13 am BST Jul 1,2008 | Reply
[…] this point, we could consider running the query again with (say) the /*+ gather_plan_statistics */ hint in place, and in another 22 hours or so we would be able to compare estimates with actuals to see […]

Pingback by Cardinality feedback « Oracle Scratchpad — May 11, 2009 @ 7:32 pm BST May 11,2009 | Reply
[…] time ago – actually a few years ago – I wrote a note about the hint /*+ gather_plan_statistics */ making some informal comments about the implementation and relevant hidden parameters. I’ve […]

Pingback by gather_plan_statistics – 2 | Oracle Scratchpad — May 22, 2013 @ 12:40 pm BST May 22,2013 | Reply
[…] Gathering the execution statistics does incur some overhead. The amount of overhead is dependent on the operating system. Jonathan Lewis discusses this in more detail on his blog. […]

Pingback by Diferencia entre cardinalidad estimada y real: hint GATHER_PLAN_STATISTICS | sersanso — July 30, 2015 @ 10:15 am BST Jul 30,2015 | Reply
[…] of small operations the times reported can be subject to fairly large rounding errors and that enabling the measurement can increase the execution time by a factor of three or four. It’s perfectly feasible that this half of the query is actually […]

Pingback by Case Study | Oracle Scratchpad — September 30, 2018 @ 7:59 pm BST Sep 30,2018 | Reply
[…] There seems to be a certain performance overhead. See for instance Jonathan’s blog. […]

Pingback by GATHER_PLAN_STATISTICS does does not generate basic plan statistics — November 2, 2022 @ 8:38 am GMT Nov 2,2022 | Reply
[…] and execution plan in detail by adding a predicate “rownum <= 12” re-executing with rowsource execution stats enabled, and pulling the plan including the Query Block Names and Object Aliases, Outline Data, […]

Pingback by Case Study | Oracle Scratchpad — September 5, 2023 @ 11:10 pm BST Sep 5,2023 | Reply

RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome. Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Oracle Scratchpad

November 25, 2007