Some time ago I was on a client site which was busy setting up a RAC cluster using Oracle 10g. Although I wasn’t involved with that particular system there were a couple of coffee-breaks where I ended up chatting with the consultant that the client had hired to install the cluster. In one of our breaks he surprised me by making the casual comment: “Of course, you can’t run Statspack on RAC because it locks up the system.”
Now there’s no reason why Statspack should lock up a RAC cluster – in principle. But this was an 8-node cluster and if you set up the automatic job to take snapshots by running the default spauto.sql script all eight nodes could start running every hour on the hour – so they would all be fighting constantly for every block of every table and index in the Statspack schema. (I’m exaggerating for effect, of course, but not by much). You might not notice the global cache contention in a 2-node cluster but eight nodes could, indeed, “lock up the system” at this point.
Ideally, of course, you would like to run all eight snapshot simultaneously and concurrently so that you get information from the same moment across all nodes. In practice you have to stagger the snapshots.
The code for taking snapshots for the AWR (automatic workload repository) gets around this problem by using the ‘WF’ lock for cross-instance synchronisation. One instance gets the lock exclusively, and that’s the instance that sets the snapshot time and coordinates the flushing to disc for all the other instances. (The other instances collect their stats at the same moment – but are cued to flush them to disc one after the other)
For Statspack you have to do something manually – and the simplest approach that people sometimes take is to run a slightly different version of sp_auto.sql on each node so that each node is scheduled to start its snapshot a couple of minutes after the previous one. But there is an alternative that is a little smarter and eliminates the risk of two instances overlapping due to time slippage or a slow snapshot.
Taking the idea from AWR we can make use of dbms_lock to ensure that each snapshot starts as soon as possible after the previous one. We simply create a wrapper procedure for statspack that tries to take a specified user-defined lock in exclusive mode before calling the snapshot procedure and releasing the lock after the snapshot is complete. We can schedule this procedure to run on every node at the start of each hour – which means every node on the system will try to take lock simultaneously – but only one node will get it and the rest will queue. After a node completes its snapshot and releases the lock the next node in the queue immediately acquires the lock and starts its snapshot.
Here’s some sample code for the wrapper. I’ve included a couple of debug messages, and a piece of code that makes the procedure time out without taking a snapshot if it has to wait for more than 10 minutes. Note that you have to grant execute on dbms_lock to perfstat for this to work.
create or replace procedure rac_statspack (i_snap_level in number) as m_status number(38); m_handle varchar2(60); begin sys.dbms_lock.allocate_unique( lockname => 'Synchronize Statspack', lockhandle => m_handle ); m_status := sys.dbms_lock.request( lockhandle => m_handle, lockmode => dbms_lock.x_mode, timeout => 600, -- default is dbms_lock.maxwait release_on_commit => false -- which is the default ); if (m_status = 0 ) then dbms_output.put_line( to_char(sysdate,'dd hh24:mi:ss') || ': Acquired lock, running statspack' ); statspack.snap(i_snap_level); dbms_output.put_line( to_char(sysdate,'dd hh24:mi:ss') || ': Snapshot completed' ); m_status := sys.dbms_lock.release( lockhandle => m_handle ); else dbms_output.put_line( to_char(sysdate,'dd hh24:mi:ss') || case m_status when 1 then ': Lock wait timed out' when 2 then ': deadlock detected' when 3 then ': parameter error' when 4 then ': already holding lock' when 5 then ': illegal lock handle' else ': unknown error' end ); end if; end; /
And here’s a little bit of pl/sql you can run on each node in turn to install the procedure under dbms_job.
declare m_job number; m_inst number; m_date date; m_jqs number; begin select instance_number into m_inst from v$instance; dbms_job.submit( job => m_job, what => 'rac_statspack(7);', next_date => trunc(sysdate + 1 / 24,'HH'), interval => 'trunc(SYSDATE + 1 / 24,''HH'')', no_parse => TRUE, instance => m_inst, force => true ); commit; select next_date into m_date from dba_jobs where job = m_job ; select value into m_jqs from v$parameter where name = 'job_queue_processes' ; dbms_output.put_line('Job number: ' || m_job); dbms_output.put_line('Next run time: ' || to_char(m_date,'dd-Mon-yyyy hh24:mi:ss')); dbms_output.put_line('Current Job Queues: ' || m_jqs || ' (must be greater than zero)'); end; /
Warning: Judging by the date stamps on my files it’s at least 18 months since I last took this approach with a system – so (a) you might want to test it carefully before you use it and (b) you might want to modify the code to use dbms_scheduler to run the job rather than dbms_job.