Oracle Scratchpad

January 14, 2011

Statspack on RAC

Filed under: Statspack — Jonathan Lewis @ 6:46 pm GMT Jan 14,2011

Some time ago I was on a client site which was busy setting up a RAC cluster using Oracle 10g. Although I wasn’t involved with that particular system there were a couple of coffee-breaks where I ended up chatting with the consultant that the client had hired to install the cluster. In one of our breaks he surprised me by making the casual comment: “Of course, you can’t run Statspack on RAC because it locks up the system.”

Now there’s no reason why Statspack should lock up a RAC cluster – in principle. But this was an 8-node cluster and if you set up the automatic job to take snapshots by running the default spauto.sql script all eight nodes could start running every hour on the hour – so they would all be fighting constantly for every block of every table and index in the Statspack schema.  (I’m exaggerating for effect, of course, but not by much). You might not notice the global cache contention in a 2-node cluster but eight nodes could, indeed, “lock up the system” at this point.

Ideally, of course, you would like to run all eight snapshot simultaneously and concurrently so that you get information from the same moment across all nodes. In practice you have to stagger the snapshots.

The code for taking snapshots for the AWR (automatic workload repository) gets around this problem by using the ‘WF’ lock for cross-instance synchronisation. One instance gets the lock exclusively, and that’s the instance that sets the snapshot time and coordinates the flushing to disc for all the other instances. (The other instances collect their stats at the same moment – but are cued to flush them to disc one after the other)

For Statspack you have to do something manually – and the simplest approach that people sometimes take is to run a slightly different version of sp_auto.sql on each node so that each node is scheduled to start its snapshot a couple of minutes after the previous one. But there is an alternative that is a little smarter and eliminates the risk of two instances overlapping due to time slippage or a slow snapshot.

Taking the idea from AWR we can make use of dbms_lock to ensure that each snapshot starts as soon as possible after the previous one. We simply create a wrapper procedure for statspack that tries to take a specified user-defined lock in exclusive mode before calling the snapshot procedure and releasing the lock after the snapshot is complete. We can schedule this procedure to run on every node at the start of each hour – which means every node on the system will try to take lock simultaneously – but only one node will get it and the rest will queue. After a node completes its snapshot and releases the lock the next node in the queue immediately acquires the lock and starts its snapshot.

Here’s some sample code for the wrapper. I’ve included a couple of debug messages, and a piece of code that makes the procedure time out without taking a snapshot if it has to wait for more than 10 minutes. Note that you have to grant execute on dbms_lock to perfstat for this to work.

create or replace procedure rac_statspack (i_snap_level in number) as
        m_status        number(38);
        m_handle        varchar2(60);

                lockname        => 'Synchronize Statspack',
                lockhandle      => m_handle

        m_status := sys.dbms_lock.request(
                lockhandle              => m_handle,
                lockmode                => dbms_lock.x_mode,
                timeout                 => 600,         -- default is dbms_lock.maxwait
                release_on_commit       => false        -- which is the default

        if (m_status = 0 ) then
                        to_char(sysdate,'dd hh24:mi:ss') ||
                        ': Acquired lock, running statspack'


                        to_char(sysdate,'dd hh24:mi:ss') ||
                        ': Snapshot completed'

                m_status := sys.dbms_lock.release(
                        lockhandle      => m_handle
                        to_char(sysdate,'dd hh24:mi:ss') ||
                        case m_status
                                when 1 then ': Lock wait timed out'
                                when 2 then ': deadlock detected'
                                when 3 then ': parameter error'
                                when 4 then ': already holding lock'
                                when 5 then ': illegal lock handle'
                                       else ': unknown error'
        end if;


And here’s a little bit of pl/sql you can run on each node in turn to install the procedure under dbms_job.

	m_job	number;
	m_inst	number;
	m_date	date;
	m_jqs	number;

	select	instance_number
	into	m_inst
	from	v$instance;

		job		=> m_job,
		what		=> 'rac_statspack(7);',
		next_date	=> trunc(sysdate + 1 / 24,'HH'),
		interval	=> 'trunc(SYSDATE + 1 / 24,''HH'')',
		no_parse	=> TRUE,
		instance	=> m_inst,
		force		=> true

	into	m_date
	from	dba_jobs
	where	job = m_job

	into	m_jqs
	from	v$parameter
	where	name = 'job_queue_processes'

	dbms_output.put_line('Job number: ' || m_job);
	dbms_output.put_line('Next run time: ' || to_char(m_date,'dd-Mon-yyyy hh24:mi:ss'));
	dbms_output.put_line('Current Job Queues: ' || m_jqs || ' (must be greater than zero)');


Warning: Judging by the date stamps on my files it’s at least 18 months since I last took this approach with a system – so (a) you might want to test it carefully before you use it and (b) you might want to modify the code to use dbms_scheduler to run the job rather than dbms_job.

[Further Reading on Statspack]


  1. Hi.

    If you are using 10.2 or above you could also do this using a scheduler job chain. Having said that, your sample code is pretty concise and the scheduler job chain setup probably looks a little more complicated if you are not used to using it. :)



    Comment by Tim Hall — January 14, 2011 @ 8:26 pm GMT Jan 14,2011 | Reply

    • Tim,

      Thanks for that thought. It look’s like I fell into the old trap of “repeat whatever worked last time” :(

      If you’ve got an example of scheduler job chain setup on your website feel free to add a link to it here.

      Comment by Jonathan Lewis — January 16, 2011 @ 5:45 pm GMT Jan 16,2011 | Reply

      • From programming point of view the lock approach is much more clear and does exactly the thing it supposed to do. Chains, on the other hand, introduce ordering which seems to be unnecessary in this situation. Code to set up a chain IMO isn’t complicated though.

        I’m exaggerating for effect, of course, but not by much

        Well, indeed this code (like any other non-RAC friendly piece of software) will put some pressure on private interconnect and LMS processes. So, in theory, it could compete with applications running in the cluster. But I think there’re chances such situations could pass without severe affect on other applications. So “lock up the system” is indeed an extreme scenario – and I wouldn’t mind knowing other details behind this sentence (with numbers).
        Also thank you for WF enqueue description. I’ve seen it multiple times (with quite huge wait time reported) but didn’t realize what Oracle tries to sync.

        Comment by Timur Akhmadeev — January 17, 2011 @ 8:14 am GMT Jan 17,2011 | Reply

RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by