Oracle Scratchpad

August 27, 2019

AW-argh

Filed under: AWR,Oracle,Troubleshooting,Upgrades — Jonathan Lewis @ 3:59 pm BST Aug 27,2019

This is another of the blog notes that have been sitting around for several years – in this case since May 2014, based on a script I wrote a year earlier. It makes an important point about “inconsistency” of timing in the way that Oracle records statistics of work done. As a consequence of being first drafted in May 2014 the original examples showed AWR results from 10.2.0.5 and 11.2.0.4 – I’ve just run the same test on 19.3.0.0 to see if anything has changed.

 

[Originally drafted May 2014]: I had to post this as a reminder of how easy it is to forget things – especially when there are small but significant changes between versions. It’s based loosely on a conversation from Oracle-L, but I’m going to work the issue in the opposite order by running some code and showing you the ongoing performance statistics rather than the usual AWR approach of reading the performance stats and trying to guess what happened.

The demonstration needs two sessions to run; it’s based on one session running some CPU-intensive SQL inside an anonymous PL/SQL block with a second another session launching AWR snapshots at carefully timed moments. Here’s the code for the working session:

rem
rem     Script:         awr_timing.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2013
rem

alter session set "_old_connect_by_enabled"=true';

create table kill_cpu(n, primary key(n))
organization index
as
select  rownum n
from    all_objects
where   rownum <= 26 -- > comment to avoid wordpress format issue
;

execute dbms_stats.gather_table_stats(user,'kill_cpu')

pause Take an AWR snapshot from another session and when it has completed  press return

declare
        m_ct    number;
begin

        select  count(*) X
        into    m_ct
        from    kill_cpu
        connect by
                n > prior n
        start with
                n = 1
        ;

        dbms_lock.sleep(30);

end;
/

You may recognise an old piece of SQL that I’ve often used as a way of stressing a CPU and seeing how fast Oracle can run. The “alter session” at the top of the code is necessary to use the pre-10g method of running a “connect by” query so that the SQL does a huge number of buffer gets (and “buffer is pinned count” visits). On my current laptop the query takes about 45 seconds (all CPU) to complete. I’ve wrapped this query inside a pl/sql block that then sleeps for 30 seconds.

From the second session you need to launch an AWR snapshot 4 times – once in the pause shown above, then (approximately) every 30 seconds thereafter. The second one should execute while the SQL statement is still running, the third one should execute while the sleep(30) is taking place, and the fourth one should execute after the pl/sql block has ended and the SQL*Plus prompt is visible.

Once you’ve got 4 snapshots you can generate 3 AWR reports. The question to ask then, is “what do the reports say about CPU usage?” Here are a few (paraphrased) numbers – starting with 10.2.0.5 comparing the “Top 5 timed events”, “Time Model”, and “Instance Activity” There are three sets of figures, the first reported while the SQL was still running, the second reported after the SQL statement had completed and the dbms_lock.sleep() call is executing, the last reported after the PL/SQL block has completed. There are some little oddities in the numbers due to backgorund “noise” – but the key points are still clearly visible:

While the SQL was executing

Top 5
-----
CPU Time                       26 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                     26.9        100.0
DB CPU                                       26.2         97.6

Instance Activity
-----------------
CPU used by this session         0.65 seconds
recursive cpu usage              0.67 seconds

SQL ordered by CPU
------------------
31 seconds reported for both the SQL and PLSQL

During the sleep()

Top 5
-----
CPU Time                        19 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                     19.0        100.0
DB CPU                                       18.6         98.1

Instance Activity
-----------------
CPU used by this session         0.66 seconds
recursive cpu usage             44.82 seconds

SQL ordered by CPU
------------------
14 seconds reported for both the SQL and PLSQL

After the PL/SQL block ended

Top 5
-----
CPU Time                         1 second

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                      1.4         99.9
DB CPU                                        1.4         99.7

Instance Activity
-----------------
CPU used by this session        44.68 seconds
recursive cpu usage              0.50 seconds

SQL ordered by CPU
------------------
1 second reported for the PLSQL, but the SQL was not reported

Points to notice:

While the SQL was excecuting (and had been executing for about 26 seconds, the Time Model mechanism was recording the work done by the SQL, and the Top N information echoed the Time model CPU figure. At the same time the “CPU used …” Instance Activity Statistics have not recorded any CPU time for the session – and they won’t until the SQL statement completes. Despite this, the “SQL ordered by …” reports double-count in real-time, show the SQL and the PL/SQL cursors as consuming (with rounding errors, presumably) 31 seconds each.

After the SQL execution was over and the session was sleeping the Time model (hence the Top 5) had recorded a further 19 seconds of work. The instance activity, however, has now accumulated 44 seconds of CPU, but only as “recursive CPU usage” (recursive because our SQL was called from with a PL/SQL block), with no “CPU used by this session”. The “SQL ordered by …” figures have recorded the amount of CPU used by both the SQL and PL/SQL in the interval (i.e. 14 seconds – which is a little off) recorded against both.)

After the PL/SQL block has completed the Time Model and the Top 5 report both say that nothing much happened in the interval, but the Instance Activity suddenly reports 44.68 seconds of CPU used by this session – which (roughly speaking) is truish as the PL/SQL block ended and assigned the accumulated recursive CPU usage to the session CPU figure. Finally, when we get down to the “SQL ordered by CPU” the SQL was not reported  – it did no work in the interval – but the PL/SQL block was still reported but only with a generous 1 second of CPU since all it did in the interval was finish the sleep call and tidy up the stack before ending.

Now the same sets of figures for 11.2.0.4 – there’s a lot of similarity, but one significant difference:

While the SQL was executing


Top 5
-----
CPU Time                        26.6 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                     27.0        100.0
DB CPU                                       26.6         98.5

Instance Activity
-----------------
CPU used by this session         1.09 seconds
recursive cpu usage              1.07 seconds

SQL ordered by CPU
------------------
25.6 seconds reported for both the SQL and PLSQL

During the sleep()

Top 5
-----
CPU Time                        15.1 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                     15.3         99.8
DB CPU                                       15.1         98.2

Instance Activity
-----------------
CPU used by this session        41.09 seconds
recursive cpu usage             41.03 seconds

SQL ordered by CPU
------------------
14.3 seconds reported for the SQL
13.9 seconds reported for the PLSQL

After the PL/SQL block ended

Top 5
-----
CPU Time                         1.4 seconds

Time Model                               Time (s) % of DB Time
------------------------------------------------- ------------
sql execute elapsed time                      1.5         99.6
DB CPU                                        1.4         95.4

Instance Activity
-----------------
CPU used by this session         1.02 seconds
recursive cpu usage              0.95 seconds

SQL ordered by CPU
------------------
0.5 seconds reported for the PLSQL, and no sign of the SQL

Spot the one difference in the pattern – during the sleep() the Instance Activity Statistic “CPU used by this session” is recording the full CPU time for the complete query, whereas the time for the query appeared only in the “recursive cpu” in the 10.2.0.5 report.

I frequently point out that for proper understanding of the content of an AWR report you need to cross-check different ways in which Oracle reports “the same” information. This is often to warn you about checking underlying figures before jumping to conclusions about “hit ratios”, sometimes it’s to remind you that while the Top 5 might say some average looks okay the event histogram may say that what you’re looking at is mostly excellent with an occasional disaster thrown in. In this blog note I just want to remind you that if you only ever look at one set of figures about CPU usage there are a few special effects (particularly relating to long running PL/SQL / Java / SQL) where you may have to work out a pattern of behaviour to explain unexpectedly large (or small) figures and contradictory figures, The key to the problem is recognising that different statistics may be updated at different stages in a complex process.

Footnote

I doubt if many people still run 11.2.0.4, so I also re-ran the test on 19.3.0.0 before publishing. The behaviour hasn’t changed since 11.2.0.4 although the query ran a little faster, perhaps due to changes in the mechanisms for this type of “connect by pump”.

11.2.0.4 stats


Name                                            Value
----                                            -----
session logical reads                      33,554,435
consistent gets                            33,554,435
consistent gets from cache                 33,554,435
consistent gets from cache (fastpath)      33,554,431
no work - consistent read gets             33,554,431
index scans kdiixs1                        33,554,433
buffer is not pinned count                 16,777,219


19.3.0.0 stats

Name                                            Value
----                                            -----
session logical reads                      16,843,299
consistent gets                            16,843,299
consistent gets from cache                 16,843,299
consistent gets pin                        16,843,298
consistent gets pin (fastpath)             16,843,298
no work - consistent read gets             16,790,166
index range scans                          33,554,433
buffer is not pinned count                 16,790,169

Some changes are trivial (like the change of name for “index scans kdiixs1”) some are interesting (like some gets not being labelled as “pin” and “pin (fastpath)”), some are baffling (like how you can manage 33M index range scans while doing only 16M buffer gets!)

June 9, 2019

CPU percent

Filed under: 12c,AWR,Oracle — Jonathan Lewis @ 2:31 pm BST Jun 9,2019

A recent post on the ODC General Database forum asked for an explanation of the AWR report values “%Total CPU” and “%Busy CPU” under the “Instance CPU” label, and how the “%Busy CPU “ could be greater than 100%.  Here’s a text reproduction of the relevant sample supplied:

Host CPU

CPUs Cores Sockets Load Average Begin Load Average End %User %System %WIO %Idle
2 2 1 0.30 1.23 10.7 5.6 5.3 77.7

Instance CPU

%Total CPU %Busy CPU %DB Time waiting for CPU (Resource Manager)
29.8 133.8 0.0

The answer is probably “It’s 12.1 and it’s a programmer error”.

  • Note that the Host CPU %Idle is not consistent with the three usage figures:  10.7 + 5.6 + 5.3 = 21.6 whereas 100 – 77.7 = 22.3.
  • So let’s run with 22.3% and see what else we can notice: 29.8 / 22.3 = 1.3363 – that’s pretty close (when expressed as a percentage) to 133.8%

Hypothesis:

Someone did the division the wrong way round when trying to work out the percentage of the host’s non-idle CPU that could be attributed to the instance. In this example the “%Busy CPU” should actually report 100 * 22.3 / 29.8 = 74.8%

Note – the difference between 133.8 and 133.63 can be attributed to the fact that the various figures reported in this bit of the AWR are rounded to the nearest 1 decimal place.

Note 2 – I don’t think this error is present in 11.2.0.4 or 12.2.0.1

 

 

 

 

 

September 24, 2014

Lock Time

Filed under: AWR,Locks,Oracle,Statspack — Jonathan Lewis @ 12:42 pm BST Sep 24,2014

Here’s a little detail I was forced to re-learn yesterday; it’s one of those things where it’s easy to say “yes, obviously” AFTER you’ve had it explained so I’m going to start by posing it as a question. Here are two samples of PL/SQL that using locking to handle a simple synchronisation mechanism; one uses a table as an object that can be locked, the other uses Oracle’s dbms_lock package. I’ve posted the code for each fragment, and a sample of what you see in v$lock if two sessions execute the code one after the other:

Table locking – the second session to run this code will wait for the first session to commit or rollback:


begin
        lock table t1 in exclusive mode;
end;
/

ADDR             KADDR                   SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
00007FF409E57BF8 00007FF409E57C58         15 TM     157778          0          0          6         65          0
00007FF409E57BF8 00007FF409E57C58        125 TM     157778          0          6          0         91          1

Using dbms_lock.

variable m_handle       varchar2(255);

declare
        n1              number;
begin

        dbms_lock.allocate_unique(
                lockname        => 'Synchronize',
                lockhandle      => :m_handle
        );

        dbms_output.put_line(:m_handle);

        n1 := dbms_lock.request(
                lockhandle              => :m_handle,
                lockmode                => dbms_lock.x_mode,
                timeout                 => dbms_lock.maxwait,
                release_on_commit       => true         -- the default is false !!
        );

        dbms_output.put_line(n1);

end;
/

ADDR             KADDR                   SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
000000008ED8F738 000000008ED8F790         15 UL 1073742427          0          0          6         42          0
000000008ED902B0 000000008ED90308        125 UL 1073742427          0          6          0        103          1

The big question is this – although the two code fragments produce the same effects in terms of lock waits and the reports from v$lock, what’s the big difference in the way that they are reported in the AWR report.

The high-level difference appears in the Time Model stats. Here are two extracts showing the difference:

Using dbms_lock.


Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
sql execute elapsed time                                 65.4         99.9
PL/SQL execution elapsed time                            63.8         97.4

-> Captured SQL account for    2.8% of Total DB Time (s):              65
-> Captured PL/SQL account for   99.4% of Total DB Time (s):              65

        Elapsed                  Elapsed Time
        Time (s)    Executions  per Exec (s)  %Total   %CPU    %IO    SQL Id
---------------- -------------- ------------- ------ ------ ------ -------------
            63.7              1         63.72   97.3     .0     .0 10u1qbw4a27sp
Module: SQL*Plus
declare n1 number; begin dbms_lock.allocate_unique
( lockname => 'Synchronize', lockhandle
 => :m_handle ); dbms_output.put_line(:m_handle); n
1 := dbms_lock.request( lockhandle => :m_handle,

Table locking method:

Time Model Statistics

Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
sql execute elapsed time                                 95.5         99.9
DB CPU                                                    0.9           .9
parse time elapsed                                        0.1           .1
hard parse elapsed time                                   0.1           .1
PL/SQL execution elapsed time                             0.1           .1

SQL ordered by Elapsed Time

-> Captured SQL account for   99.6% of Total DB Time (s):              96
-> Captured PL/SQL account for   98.7% of Total DB Time (s):              96

        Elapsed                  Elapsed Time
        Time (s)    Executions  per Exec (s)  %Total   %CPU    %IO    SQL Id
---------------- -------------- ------------- ------ ------ ------ -------------
            93.9              1         93.88   98.3     .0     .0 8apkdghttmndx
Module: SQL*Plus
begin lock table t1 in exclusive mode; end;

            93.9              1         93.88   98.3     .0     .0 29fwr53agvbc0
Module: SQL*Plus
LOCK TABLE T1 IN EXCLUSIVE MODE

The time spent waiting for the table lock is reported purely as SQL execution time in the Time Model stats; but the time spent waiting for the user-defined lock is reported as SQL execution time AND as PL/SQL execution time. I had forgotten this yesterday so, as I skipped through the various headline figures of an hourly snapshot, I was amazed to see the Time Model stats reporting 33,000 seconds of PL/SQL and 66,000 seconds of SQL – how on earth do you manage to do that much PL/SQL on any Oracle system. (To add to the embarrassment, it had only been a few moments earlier that I’d gone through the Top 5 Timed Events and said something like: “you can ignore all time spent on ‘enq: UL – contention’, it’s probably just synchronisation code”.

In this case the SQL ordered by Elapsed Time gives you a strong visual clue about what’s going on – but it won’t always be that obvious.

Bottom Line:

PL/SQL execution time includes the time spent waitng for UL locks, don’t forget that you may need to subtract wait time for ‘enq: UL – contention’ from the PL/SQL time before you start to worry about how much actual work you’re doing in PL/SQL.

 

June 25, 2014

AWR thoughts

Filed under: AWR,Oracle,Statspack — Jonathan Lewis @ 3:35 pm BST Jun 25,2014

It’s been a week since my last posting – so I thought I’d better contribute something to the community before my name gets lost in the mists of time.

I don’t have an article ready for publication, but some extracts from an AWR report appeared on the OTN database forum a few days ago, and I’ve made a few comments on what we’ve been given so far (with a warning that I might not have time to follow up on any further feedback). I tried to write my comments in a way that modelled the way I scanned (or would have scanned) through the reporting – noting things that caught my attention, listing some of the guesses and assumptions I made as I went along.  I hope it gives some indication of a pattern of thinking when dealing with a previously unseen AWR report.

 

 

November 30, 2012

v$sqlstats

Filed under: AWR,Bugs,Oracle,Troubleshooting — Jonathan Lewis @ 5:45 pm BST Nov 30,2012

Sometimes you find bugs on MOS (Metalink, OCIS, whatever) that make you feel positively ill. I’ve just been on a customer site where (in passing) they mentioned that one of their historic queries against v$sqlstats now took just over one second (CPU) in 11.2.0.3 when it had previously taken about 200ms on 10.2.0.4[1]. After a little checking it seemed likely that the change was possibly related to the fact that they had increased the size of the SGA significantly, allowing for a much larger shared pool and library cache; however there have been numerous code changes in the shared pool area on the route from 10g to 11g so I decided to check MOS to see if anyone else had seen a similar problem. I found this:

(more…)

February 10, 2012

11.2 New views

Filed under: AWR,Oracle,Troubleshooting — Jonathan Lewis @ 7:55 am BST Feb 10,2012

It looks like Laimutis Nedzinskas has spent some time producing  a list of new views, and changes to existing views, for trouble-shooting specialists. The lists cover the dynamic performance views and the AWR history views, and the differences between 10.2 and 11.2. The lists include, where relevant, convenient hyperlinks to the view definitions in the online copy of  the 11g Server Reference manual.

I haven’t checked the lists in detail, but I can see myself  taking advantage of his efforts from time to time:

February 23, 2011

AWR Reports

Filed under: AWR,Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 6:51 pm BST Feb 23,2011

A little thought for those who use OEM to generate AWR reports – if you generate the standard HTML form and then try posting it into (say) the OTN database forum when you want some help interpreting it, it won’t look very readable (even if you do remember to put the right tags before and after to reproduce it in fixed font).
(more…)

April 14, 2010

Analysing Statspack 12

Filed under: AWR,Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 6:54 pm BST Apr 14,2010

[Further Reading on Statspack]

Part 12 is about a 2-node RAC system.  Someone recently posted links to a matched pair of AWR reports in this thread on the OTN Database General forum and, after I had made a couple of quick comments on them, gave me permission to use them on my blog.
(more…)

July 10, 2009

Statspack Skills

Filed under: AWR,Oracle,Performance,Statspack,Troubleshooting — Jonathan Lewis @ 7:28 pm BST Jul 10,2009

I had a great time a couple of weeks ago at the  UKOUG meeting of the DBMS SIG (reported here by Coskan Gundogar). The range of presentations was good and I had a number of interesting conversations.

Of course, the exciting part for me was sitting down with a batch of Statspack and AWR reports that I had been supplied with in the previous couple of days and doing a “real-time” analysis of them.

(more…)

November 19, 2006

AWR Dilemma

Filed under: AWR,Infrastructure,Oracle — Jonathan Lewis @ 6:25 pm BST Nov 19,2006

Here’s an odd little observation that came out of the UKOUG conference last week:

If you’re running 10g, and don’t have the relevant performance pack and diagnostic licences there are various views you are not allowed to query, packages you are not allowed to run, and reports you are not allowed to read.

To support the performance diagnostics, though, your database is probably churning away constantly dumping all sorts of useful data into the memory structures and tables that you are not allowed to access.

But that’s no longer a problem – it looks as if you can disable the AWR collection very easily in 10.2 with the following procedure call:

execute dbms_workload_repository.modify_snapshot_settings(interval => 0)

There’s just one problem with this – you can’t run this package unless you’re licensed to run the AWR. So if you want to disable the AWR, it looks like you have to pay to run it first.

Powered by WordPress.com.