Oracle Scratchpad

February 6, 2008

Trouble-shooting

Filed under: Troubleshooting — Jonathan Lewis @ 7:32 pm BST Feb 6,2008

How do you trouble-shoot ? Sometimes the best approach depends where you are starting from and how much time you have to act. Here’s a little  sequences of events that cropped up some time ago while I was on a customer visit the day after an upgrade from 9i to 10g.

I had just run off a couple of Statspack reports and was looking at an SQL statement that was at the top of the ‘SQL ordered by Gets’ section of the report, when it had never previously been on the report at all when the application owner called me to complain that the users were apparently seeing lots of locking on the system, and an online task that used to take a couple of seconds was frequently taking three or four minutes, and anything up to 15 minutes.

So here’s a quick and dirty script I ran to check if there were any blocking lockers – restricted only to transaction locks (type TX).

break on blocker skip 1 on block_time         

select
        l1.sid blocker, l1.ctime block_time,
        l2.sid waiter,  l2.ctime wait_time
from
	(
        select
                /*+ no_merge */
                sid, type, id1, id2, lmode, request, block, ctime
        from
                v$lock
        where
                type in ('TX')
	)       l1,
	(
        select
                /*+ no_merge */
                sid, type, id1, id2, lmode, request, block, ctime
        from
                v$lock
        where
                type in ('TX')
	)       l2
where
        l1.block	 = 1
and     l2.id1		 = l1.id1
and     l2.id2		 = l1.id2
and     l2.sid		!= l1.sid
order by
        l1.sid, l2.sid
;

This produced the following type of output – which persisted for many seconds before changing to show another session as the blocker or a few more sessions becoming waiters.

   BLOCKER BLOCK_TIME     WAITER  WAIT_TIME
---------- ---------- ---------- ----------
       379         17        307         13
                             353         17
                             474         13

Given the amount of time spent by the waiters (that’s the nice thing about locking problems, they’re often very slow moving), I was able to run the following queries to see what was going on:

SQL> select sql_id from v$session where sid = 379;        

SQL_ID
-------------
dr6uztacr84z8        

SQL> select sql_id from v$session where sid in (307,353,474);         

SQL_ID
-------------
0m9jzuyw7q594
0m9jzuyw7q594
0m9jzuyw7q594

A quick query against v$sql reported the following query for the waiters. Note the ‘for update’ in particular, they were all waiting for the same row in a summary table:

select	columnX
from	tableY
where	ID = 'A0001'
for update

Of course, the real oddity was that prior to the upgrade no-one had complained about locking. The application code had not been changed, so why could they possibly be running into a locking issue after the upgrade.

I ran the following query to check what the blocker was up to.

select
        executions, disk_reads, buffer_gets, sorts, rows_processed,
        round(cpu_time/1000000,2) cpu,  round(elapsed_time/1000000,2) ela,
        round(cpu_time/(executions * 1000000),2) cpu_per,
        round(elapsed_time/(executions * 1000000),2) ela_per,
        sql_text
from
        v$sql
where
        sql_id = 'dr6uztacr84z8'
;         

EXECUTIONS DISK_READS BUFFER_GETS      SORTS ROWS_PROCESSED        CPU        ELA    CPU_PER    ELA_PER
---------- ---------- ----------- ---------- -------------- ---------- ---------- ---------- ----------
SQL_TEXT
--------------------------------------------------------------------------------
       356    1973608  2528529790          0             88    28753.1   28759.14      80.77      80.78
SELECT /*+ {hints} */ FROM {list of tables}
WHERE
	{complex mix of conditions}
AND EXISTS (
	{single table subquery}
)
and rownum = 1
;

I was just slightly amused to see that this query was exactly the query that I had been looking at in the moments before the call came.

It turned out that prior to the upgrade this query took a few hundredths of a second to run; after the upgrade, largely due to a change in the optimizer’s capability, a change in execution plan pushed it into taking 80 seconds of CPU.

Unfortunately, the session would take a lock on a critical row before running this query, then commit shortly afterwards. The application had always had a locking problem, but since it lasted at most a few moments no-one had every noticed it.

This type of locking issue is self-reinforcing, of course. If you hold a lock for a few hundredths of a second you may get lucky and find that no-one goes for the same lock in the same time window. Hold it for 80 seconds, and you increase the chances that someone else will want it at about the same time. It would only take four waiters in the queue to get to the five minute wait times that some of the users had been complaining about – and there always seemed to be three or four whenever I looked.

The problem with this query was a statistical oddity that persuaded the (new, improved) optimizer to do something that in theory was a good thing, but actually turned out to be a very bad thing. The solution was simply to gather statistics at the right moment in time.

7 Comments »

  1. How do you trouble-shoot ?

    Normally, I’m not lucky as you are. At work, for security problems, we, me and other DB consultants, can not run stats-pack.

    So I normally use a combination of: select sid from v$session and identify the developer’s session that claim problems, select sql_text from v$sql using the sid, and then take a look at session stats and locks.

    I must say that I work with sql developers, and generally the solution to the problem is an alternative sql syntax or a particular hint.

    Comment by lascoltodelvenerdi — February 7, 2008 @ 9:06 am BST Feb 7,2008 | Reply

  2. Jonathan-

    What do you mean by “gather statistics at the right moment in time”? Are you simply referring to tables which change size and shape drastically and would thus cause the optimizer to behave poorly when the table isn’t the shape it expected? Or is there something else that would cause the time at which you gather statistics to change the results?

    Comment by Ben in Boston — February 7, 2008 @ 1:14 pm BST Feb 7,2008 | Reply

  3. Ben, you assumption about dramatic changes in shape and size is correct. It’s a problem that can appear quite easily in international systems – especially in the financial arena -there are often a few tables that may need to have their statistics refreshed as each market runs its end of day: i.e. three or four times a day at carefully judged moments.

    Comment by Jonathan Lewis — February 7, 2008 @ 8:55 pm BST Feb 7,2008 | Reply

  4. Jonathan,

    is there a particular reason why you do not use a self join and employ “no_merge”? My guess would be that it’s because you access a V$ view but I do not understand the reasoning here.

    Thanks!

    Comment by Robert — February 8, 2008 @ 12:50 pm BST Feb 8,2008 | Reply

  5. Robert,

    No good reason whatsoever. I was on a client site, I wrote it in about 3 minutes, I knew it would perform efficiently in this form (it’s generally a good strategy to do something like this with v$ if you want to use them for anything other than a totally trivial query) and didn’t want to waste time checking that a simple join with a couple of no_merge() hints would behave properly.

    You’ll notice that there’s an “in” operator which isn’t necessary either – it’s just that the first cut included ‘TM’ locks as well.

    Since you asked the question, I just rewrote it without the inline views on a 10.2.0.1 instance with fixed stats gathered, and the plans are identical.


    select
            /*+
                    no_merge(l1)
                    no_merge(l2)
            */
            l1.sid blocker, l1.ctime block_time,
            l2.sid waiter,  l2.ctime wait_time
    from
            v$lock  l1,
            v$lock  l2
    where
            l1.type  = 'TX'
    and     l1.block = 1
    and     l2.type  = 'TX'
    and     l2.id1   = l1.id1
    and     l2.id2   = l1.id2
    and     l2.sid  != l1.sid
    order by
            l1.sid, l2.sid
    /

    Comment by Jonathan Lewis — February 8, 2008 @ 1:14 pm BST Feb 8,2008 | Reply

  6. Really excellent article.

    Thank you verymuch Jonathan Lewis

    Regards
    Srinivas

    Comment by srinivas — December 8, 2008 @ 8:28 am BST Dec 8,2008 | Reply

  7. […] couple of years ago?  Oh yes, I did remember that Jonathan Lewis has blogged on a similar case (https://jonathanlewis.wordpress.com/2008/02/06/trouble-shooting-2/#more-320) .  Then I did understand that the TX – row lock contention wait event could be just a side […]

    Pingback by Trouble shooting read by other sessions wait event « — June 7, 2011 @ 6:03 pm BST Jun 7,2011 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.