Oracle Scratchpad

January 26, 2012

Ouch!

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 8:19 pm BST Jan 26,2012

Here’s a set of Instance Activity stats I’ve never seen before, and I’d rather never see again. From an active standby running 11.1.0.7 on AIX:

select
        name, value
from    v$sysstat
where
        name in (
                'consistent gets - examination',
                'consistent gets',
                'session logical reads'
        )
or      name like 'transaction tables%'
;

NAME                                                                            VALUE
---------------------------------------------------------------- --------------------
session logical reads                                                 102,731,023,313
consistent gets                                                       102,716,499,376
consistent gets - examination                                          98,170,595,252
transaction tables consistent reads - undo records applied             96,590,314,116
transaction tables consistent read rollbacks                                2,621,019

5 rows selected.

The instance has been up for about 60 hours – and 95% of the work it has done has been trying to find the commit times for transactions affecting blocks that are in need of cleanout. If you look at the two statistics about the transaction tables (those are the things in the undo segment header blocks) you can see that the average work done to find a commit time was a massive 48,000 visits to undo blocks.

The solution was fairly simple – kill all the reports which had been running for the last six hours, because they were the ones that were causing a problem, while simultaneously suffering most from the problem – at the time I killed the worst offender it was managing to read about 50 blocks per minute from the database, and doing about 100,000 buffer visits to undo blocks per second.

You probably won’t see this every again, but if you do, a quick check is:

select * from v$sess_io order by consistent_changes; 

Repeat a couple of times and check if any sessions are doing a very large number (viz: tens of thousands) of consistent changes per second.

9 Comments »

  1. Jonathan,

    If I am understanding this correctly, this is not a bug but rather just a normal consequence of the active standby continuing to recover as it is begin queried. Is my understanding correct?

    Thanks,

    John

    Comment by John Darrah — January 26, 2012 @ 9:22 pm BST Jan 26,2012 | Reply

    • John,

      I think I’d have to agree with your assessment. There are a couple of anomalies that I want to look into in case they are contributing (or exacerbating) factors, but I think the effect appears because

      a) the standby can’t cleanout blocks when it visits them – it’s read-only
      b) the primary isn’t going to clean every block out very promptly because the nature of the application means that some blocks that get changed will only be revisited (at least in the near future) by the reports.
      c) lots of blocks will be changed, committed, but not cleaned out in the primary as some of the biggest reports are running in the standby

      Comment by Jonathan Lewis — January 26, 2012 @ 11:32 pm BST Jan 26,2012 | Reply

      • Wasn’t there a “minscn” optimization that’s supposed to kick in? Isn’t this the reason why restarting the report queries helped in your case?

        Comment by Christo Kutrovsky — January 27, 2012 @ 5:01 pm BST Jan 27,2012 | Reply

        • Christo,

          I think the minscn is supposed to help when it’s possible to determine logically a minimum SCN by which the commit must have happened – for example the commit on a block in a read-only tablespace must have happened before the tablespace went read-only.

          In a busy system, though, if you are constantly changing (and committing) data and I start to run a long query, there have to be cases where I can know that you’ve committed a transaction, but need to know exactly when you committed that transaction.

          Comment by Jonathan Lewis — January 28, 2012 @ 10:42 am BST Jan 28,2012

  2. Even I could see this kind of activity on our database. It is 2 node RAC cluster running ETL on one node and reports on the other node.

    NAME                                                                                VALUE
    ---------------------------------------------------------------- ------------------------
    session logical reads                                                     347,206,851,483
    consistent gets                                                           314,535,839,356
    consistent gets - examination                                              46,844,294,185
    transaction tables consistent reads - undo records applied                      9,809,799
    transaction tables consistent read rollbacks                                       15,213

    Comment by raova — January 27, 2012 @ 10:40 am BST Jan 27,2012 | Reply

    • Raova,

      Thanks, wseful follow-up (even though the scale of the issue isn’t in the same order of magnitutude).

      I’ve always assumed that this could happen on a RAC system that tried to do OLTP on one node and reporting on the other, but I’ve not actually seen it on a production system. If you have access to metalink there are a couple of bugs/notes addressing exactly this type of symptom. In part it relates to the effects of undo autotuning and the way that you can end up with very large undo segments.

      Comment by Jonathan Lewis — January 28, 2012 @ 10:38 am BST Jan 28,2012 | Reply

  3. Jonathan,

    I guess there really is no such thing as a free lunch. The scenario you describe would be pretty common for people trying to leverage their standby as reporting environment. I guess one way to mitigate this type of behavior would be to periodic full scans of the most used reporting tables but even that may not help if the tables are read direct: http://antognini.ch/2009/07/impact-of-direct-reads-on-delayed-block-cleanouts/, This is a really good caveat for people to be aware of when considering Active Standby Database. I’m actually presenting on how to leverage some Oracle features to clone databases at this year’s RMOUG event. Would you mind if I brought this up (giving you your due credit of course)?

    John

    Comment by John Darrah — January 27, 2012 @ 6:03 pm BST Jan 27,2012 | Reply

    • John,

      Feel free to bring it up. In fact I’ll probably be in the audience anyway.

      I’m not sure that doing periodic scans would help because the problem relates to changes made after the report starts running.

      In the old days (pre-automatic undo management, undo autotuning etc.) the problem would probably have shown up as an ORA-01555 error. The fact that there’s lots of undo available to protect you from the crash is the source of the extra work you do trying to complete the report.

      Comment by Jonathan Lewis — January 28, 2012 @ 10:49 am BST Jan 28,2012 | Reply

  4. Jonathan,

    we hit this unpleasant situation several times in 9iR2. We had few huge tables (several 100s of GB), which were filled by direct path reads every day. There were some weekly reports which used only part of those tables, but still selected large volume of data. The only sensible execution plan was using indexes and nested loops joins. There was a lot of work done by reports to do the delayed block cleanout work. But from time to time, it was really expensive because of finding commit time (when few commit intensive jobs started in parallel with our reports)
    The nasty thing is oracle does not provide simple solution to this issue. Oracle introduced (I think it was 9iR2) optimization for FTS of data imported by direct path read operation. Oracle is able to find chunks of data filled by one direct path job and will perform delayed block cleanout only for the very first data block ( this is one of the reasons why blocks formatted by direct path have initially 3 ITL slots). Mentioned optimization is not possible for INDEX RANGE SCAN, causing several headaches to me and my colleagues, because we were unable to clean all “dirty” blocks in efficient way (just like with single session direct path reads in 11g).

    The issue could be even worse for data stored in read only tablespaces.

    Regards
    Pavol Babel

    Comment by Pavol Babel — February 7, 2012 @ 2:41 am BST Feb 7,2012 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 3,508 other followers