Oracle Scratchpad

January 29, 2018

Case Study – 1

Filed under: Oracle,Statspack,Troubleshooting — Jonathan Lewis @ 10:45 am BST Jan 29,2018

It has been some time since I wrote an article walking through the analysis of information on an AWR report, but a nice example appeared a few weeks ago on Twitter that broke a big AWR picture into a sequence of bite-sized chunks that made a little story so here it is, replayed in sync with my ongoing thoughts. The problem started with the (highly paraphrased) question – “How could I get these headline figures when all the ‘SQL ordered by’ sections of the report show captured SQL account for 0.0% of Total?”. The report was a 30 minute snapshot from 11.2.0.3, and here’s the Load Profile:As you can see, the database was doing quite a lot of work as measured by the physical reads and writes, the number of block changes and size of redo, and the fact that the average CPU usage by the foreground sessions in this instance accounts for 9.5 CPUs. Generally speaking the 49 seconds per second of DB time is also a fairly good hint,when combined with the other numbers, that the instance is busy but, in the absence of any other information, that could be one session holding a blocking lock with 48 other sessions waiting for the lock to be released.

There are a couple of unusual (though not impossible) features to this profile. Given the amount of work the figures for user calls and executes look very small – again not technically impossible, just unlikely in “normal” database processing given the apparent workload – and bear in mind that the “SQL ordered by ..” figures seem to account for none of the workload. Similarly the figures for logical reads and block changes are a little unusual (for “normal” processing) – on average this instance seems to have modified every block it visited (without any work showing up in the captured SQL).

Next in the set of images was the Time Model:As you can see, the foreground time (DB time) is 85,944 seconds or which foreground CPU usage (DB CPU) is 16,735 seconds – with about 69,000 seconds unaccounted ! THere’s virtually no time spend on PL/SQL or SQL, and Rman doesn’t even make an appearance  (I mention rman specifically because there was at least one version of Oracle where the rman time was accidentally lost from this summary).

So what does the Top Timed Events look like:

It’s nice to see that this is consistent with the Load Profile: the DB CPU matches, and there’s a lot of physical reads (and a quick check says that 6,560,642/1,800 = 3,644 … which is pretty close to the 3,746 physical reads per second in the Load Profile).  There’s one major anomaly here, though: the huge number of (and time spent on) latch: row cache objects. and even though it’s not the single largest component of time it’s the most obvious one to pursue so the next bit of the report to check is the Dictionary Cache Statistics, with the Tablespace IO Stats and Segments by Physical Reads to follow. I don’t have an image for the dictionary cache stats, but the answer to the question “What were all the rowcache object gets for?” was: “dc_tablespaces (214,796,434)” – which (probably) told me everything I needed to know.

I could show you the specific Instance Activity statistic that I wanted to see next, but I’ll postpone that for a moment and jump to the Tablespace IO Stats – which we were planning to do and might have done anyway if we hadn’t known the rocket science behind massive number of gets on dc_tablespaces.

That’s a horrendous number of (single block) reads of the undo tablespace – and why would they be happening ? The instance engaged in some massive rollback activity (and the transactions being rolled back are on objects in the GIRO tablespace – which is why it is also suffering a lot of single block reads) and this is the point where we jump to the relevant Instance Activity statistic to confirm the claim:

There are two other “undo records applied” statistics, but we don’t need to think about them – the match between the count of records applied and the gets on the dc_tablespaces latch is nearly perfect. Almost everything that this instance is doing is rolling back – there must have been some enormous data changes (possibly only one, possibly parallel-enabled) that failed in the previous half hour and now the entire mess is being cleaned up.

One little detail to note – the “undo records applied” per second is 122,355, but the Load Profile reported 247,885 “Block changes” per second. The two figures are consistent with each other. Each application of an undo record is two block changes – the first when you modify the original data block, the second when you update the undo record itself to flag it as “applied”:  122,355 * 2  = 244,710, which is a good match for 247,855.

Final Thoughts

There is a second reason why you could see lots of gets on dc_tablespaces – but I think it was a bug in 9i relating to temporary tablespace groups. The phenomenon as a side effect of rolling back was something I discovered in the 8i timeline and I’ve not actually checked what an AWR report really would look like if I forced a massive rollback to take place as the only workload across the interval – so here’s a quick test I constructed and ran to finish the thread off:


set timing on

create table t1 as
with g as (select rownum id from dual connect by level <= 1000)
select rownum id, rownum n1, rpad ('x',150) padding from g,g
;

create index t1_i1 on t1(id);

begin
        for i in 1..1000000 loop
                update t1 set n1 = n1 where id = i;
        end loop;
end;
/

alter system flush buffer_cache;

pause take AWR snapshot from another session

rollback;

prompt take AWR snapshot from another session

On the small VM I used to run the test it took a little over 60 seconds to run the update, and the same again to complete the rollback. The “DB time” show in the Time Model section of the AWR report was 62.0 seconds, while the “sql execute elapsed time” was 3.9 seconds (which was the SQL executed while capturing the AWR data).

Conclusion

This was a problem where the database seemed to have done a lot of work that couldn’t be attributed to an SQL. While I had a little rocket science up my sleeve that may have allowed me to identify the source more quickly and with more confidence than the average DBA all I’ve done in this note it highlight a couple of oddities and big numbers that anyone could have spotted, and followed a couple of simple guesses:

a) DB time was large, but sql (and pl/sql) time was non-existent

b) Top timed events were I/O and latches – so identify the latch

c) The specific latch pointed towards tablespaces – so check the Tablespace I/O and note the UNDO tablespace

d) Look at any Instance Activity statistics about “undo”.

e) Job done – but a quick test case is the icing on the cake.

 

 

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Powered by WordPress.com.