Oracle Scratchpad

March 3, 2014

Flashback Fail ?

Filed under: Bugs,Oracle,Troubleshooting — Jonathan Lewis @ 4:19 pm GMT Mar 3,2014

Sitting in an airport, waiting for a plane, I decided to read a note (pdf) about Flashback data archive written by Beat Ramseier from Trivadis.  I’d got about three quarters of the way through it when I paused for thought and figured out that on the typical database implementation something nasty is going to happen after approximately 3 years and 9 months.  Can you guess why ?

It’s all about smon_scn_time – which normally records one row every five minutes (created by smon) with a continuous cycle of 24 hours – typically giving you about 1,440 rows in the table. The table is in a cluster, and the cluster key is the instance (thread) number. Clearly this was originally a clever idea from someone who realised that a cluster key of thread number would be beneficial if you had a RAC system with multiple instances – each instance gets its own blocks and the data for any one instance is as well clustered as possible.

The trouble is, when you enable flashback data archive smon no longer sticks to a 24 hour cycle, it just keeps adding rows. Now on my 8KB block tablespace I see 6 rows per block in the table/cluster – which means I get through 48 blocks per days,  17,520 blocks per year, and in 3 years and 9 months I’ll get to roughly 65,700 blocks – and that’s the problem.  An index entry in a cluster index points to a chain of cluster blocks, and the last two bytes of the “rowid” in the index entry identify which block within the chain the cluster key scan should start at – and two bytes means you can only have 65,536 blocks for a single cluster key.

I don’t know what’s going to happen when smon tries to insert a row into the 65,535th (-ish) block for the current thread – but it ought to raise an Oracle error, and then you’ll probably have to take emergency action to make sure that the flashback mechanisms carry on running.

Although oraus.msg indicates that it’s an error message about hash clusters it’s possible that the first sight will be: Oracle error: “ORA-02475 maximum cluster chain block count of %s has been exceeded”. If you’re using a 16KB block size then you’ve got roughly 7.5 years, and 32KB block sizes give you about 15 years (not that that’s a good argument for selecting larger block sizes, of course.)

Footnote:

Searching MoS for related topics (smon_scn_time flashback) I found doc ID: 1100993.1 from which we could possibly infer that the 1,440 rows was a fixed limit in 10g but that the number of rows allowed in smon_scn_time could increase in 11g if you enable automatic undo management. I also found a couple of bugs relating to index or cluster corruption – fixed by 11.2.0.4, though.

 

 

14 Comments »

  1. Jonathan, this sounds like a serious issue in an option we nearly went for at one point. Have Oracle responded?

    Comment by John Thomas — March 4, 2014 @ 10:39 am GMT Mar 4,2014 | Reply

    • It’s not mine – I don’t raise SRs – but here’s a very recent bug which looks like an exact match: “Bug 18294320 : ORA-1555 ON FDA TABLES” (The title doesn’t give an immediate clue – but I was searching the bug database on flashback and smon_scn_time). Reading the bug there’s a suggested temporary workaround of recreating the smon_scn_time table as a non-clustered table.

      I also infer from the content of the bug that you might not notice that you have a problem until you finally want to use the archive – unless you happen to have previously noticed that the smon trace file had grown very large dumping ORA-02475 error message.

      Comment by Jonathan Lewis — March 6, 2014 @ 10:57 am GMT Mar 6,2014 | Reply

  2. Bug 22456983 : LIMIT ON SMON_SCN_TIME AFFECTING FDA
    Unfortunately, the publicly visible bug text doesn’t provide enough information. Only that a 12c customer is unable to insert data !

    Comment by Hemant K Chitale — January 21, 2016 @ 9:32 am GMT Jan 21,2016 | Reply

  3. Oracle Support suggests to rename the mapping table, recreate it as a non-clustered and finally reload the data.

    Comment by Nenad Noveljic — August 22, 2016 @ 9:02 pm GMT Aug 22,2016 | Reply

    • Nenad,

      Do you have a MoS document or bug number where that suggestion is officially made ? It would be nice to have a the reference for future readers.

      Comment by Jonathan Lewis — August 23, 2016 @ 1:18 pm GMT Aug 23,2016 | Reply

      • Jonathan,
        First and foremost, thank you for, as always, an excellent analysis.
        Unfortunately, there is no official document for fixing the problem. However, I got the following procedure from Oracle Support after opening a service request:

        rename smon_scn_time to smon_scn_time_org 
         / 
         create table smon_scn_time 
         tablespace sysaux 
         as select * from smon_scn_time_org 
         / 
         drop index smon_scn_time_tim_idx 
         / 
         create unique index smon_scn_time_tim_idx on smon_scn_time(time_mp) 
         tablespace SYSAUX 
         / 
         drop index smon_scn_time_scn_idx 
         / 
         create unique index smon_scn_time_scn_idx on smon_scn_time(scn) 
         tablespace SYSAUX 
         / 
        

        Note, there would be a data loss between the first occurrence of ORA-02475 and applying the fix.
        Prior to obtaining the final fix, I was shown the content of the MOS How To Recreate SMON_SCN_TIME Table and Related Cluster ( Doc ID 837812.1 ) which turned out to be useless, because it just recreates the table without taking care of the old data.
        Oracle Development is currently working on a sustainable solution which will be implemented in one of the future releases. The bug is tracked under Bug 22567457 : PERMANENT FIX FOR SMON_SCN_TIME LIMITS.

        Comment by Nenad Noveljic — August 23, 2016 @ 5:26 pm GMT Aug 23,2016 | Reply

        • Nenad,

          Thanks for the reply – there are a couple more notes on MoS which suggest the same strategy, I think I’ve used a search for smon_scn_time to find them in the past.
          One thought crosses my mind on reading the SQL you’ve posted, though. The indexes are both declared unique, but the table holds data from multiple instances if you’re running RAC and I can’t think of any good reason why two instances couldn’t record the same SCN or same time_mp at the same time.

          I’ll have to check it on a RAC system, but it looks as if uniqueness ought to be based on (thread, SCN) and (thread, time_mp); maybe Oracle has a lock that is taken by each instance in turn so that they can’t collide. Or maybe (since having multiple instances record the values) only one smon every populates the table.

          Comment by Jonathan Lewis — August 23, 2016 @ 5:52 pm GMT Aug 23,2016

  4. Jonathan,
    I couldn’t find any publicly available MoS. Curiously, I got the procedure from the support only after an advanced support engineer got involved. The default answer from the support engineer was already mentioned MOS How To Recreate SMON_SCN_TIME Table and Related Cluster ( Doc ID 837812.1 ), which would have implied a data loss.
    As the SR is still open, I asked the questions about RAC. Could you imagine, that the smon just tries to insert the data, gets a constraint violation in the case that the SCN already ecists, and ignores the error?

    Comment by Nenad Noveljic — August 23, 2016 @ 7:34 pm GMT Aug 23,2016 | Reply

    • Nenad,
      I won’t make any guesses until I’ve looked at a live RAC system, but the following is probably a good clue of how Oracle avoids collisions:

      
      SQL> select * from V$lock_type where lower( description) like '%smon%';
      
      TYPE NAME                             ID1_TAG                          ID2_TAG                          IS_
      ---- -------------------------------- -------------------------------- -------------------------------- ---
      DESCRIPTION
      ------------------------------------------------------------------------------------------------------------------------------------
      TD   KTF map table enqueue            0                                0                                NO
      Serializes updates and inserts to the SMON_SCN_TIME mapping table
      
      TE   KTF broadcast                    0                                0                                NO
      Serializes broadcasts for flushes to SMON_SCN_TIME
      
      SM   SMON Serialization               0                                0                                NO
      Lock to check SMON global work in RAC
      
      

      Comment by Jonathan Lewis — August 23, 2016 @ 7:45 pm GMT Aug 23,2016 | Reply

      • Jonathan,
        Here is what Oracle Support has said about avoiding collisions in RAC:
        “The SCN changes whenever the database issues a COMMIT statement and it’s a system-wide value stored in the control files for each node. In a RAC cluster, each of the nodes generate its own SCN and then synchronizes it to the rest of the RAC nodes. The Lamport algorithm works with system change numbers (SCN) and the distributed lock manager (DLM) to control and lock management and conflict resolution.
        From Oracle 10gR2 onwards, the LGWR background process will broadcast the current SCN to all the instances (nodes) in the cluster, before committing any transaction. “

        Comment by Nenad Noveljic — August 25, 2016 @ 2:34 pm GMT Aug 25,2016 | Reply

        • I’m never too sure when talking to Oracle support whether I’ve got someone who actually knows what they’re talking about, or whether it’s someone who has read a note that isn’t visible to customers – but that comment about the SCN being stored in the control files doesn’t sound likely – you don’t want EVERY commit to have to lock and then wait to write to a control file, and you don’t want every commit to have to re-read every node’s “commit record” before deciding what its SCN should be. On the other hand you do need to resynchronize the instance synchronized so that they are all running at the same SCN as much as possible – hence the broadcast on commit.

          That doesn’t stop two (or more) instances attempting to commit at the same instant and then causing a (possibly very small) delay if they have to try to resolve the conflict. As far as the SMON_SCN_TIME update is concerned, though, that collision can be pre-empted by something like the TD lock above – if it’s a global lock (hence the reference to the DLM) each instance can guarantee that it’s the only one holding it exclusive when they update the table, and it gives the instance a mechanism to guarantee that it gets itself to a new (global) SCN before inserting a new row.

          On the other hand, the time_mp column (used for the other unique index) is the number of seconds since 1st Jan 1970. So if there’s a delay mechanism trying to guarantee uniqueness then it’s a pretty long delay (one second per node). I’ll have to try it, but I think the solution may simply be that each smon doesn’t even try to insert a row if there’s a collision (the TD lock, if that’s the one used, could carry the time_mp value from the previous insert), or maybe it allows the “duplicate key” error and ignores it.

          Comment by Jonathan Lewis — August 28, 2016 @ 9:56 am GMT Aug 28,2016

        • Jonathan,
          Thank you for sharing the insights.

          Comment by Nenad Noveljic (@NenadNoveljic) — August 31, 2016 @ 11:21 pm GMT Aug 31,2016

        • Nenad,

          Just a couple of details – though nothing like a proper description – for RAC:

          Despite the THREAD column suggesting the each instance keeps it’s own set of rows the thread is always set to zero whichever thread inserts or updates.

          In general there is only one insert for every 5 minutes, but the smons can get a little out of sync, and there is some code that runs that seems to check if the smon that inserted a row has inserted “enough” SCN mapping values and if not a second smon can insert another row for the same time period (even though the time stamp and SCN will be slightly different), (For example I have an smon_scn_time table with two rows for 10:14:20 and 10:14:27, the first has num_mappings = 0 the second has num_mappings = 89. Whenever I see two for a time period the first one seems to have zero num_mappings – but that may be a coincidence.

          I’ve also noted that the time_dp can slip from the date/time supplied by sysdate by a significant margin (perhaps smon is calling vktm instead of calling the system time for its map).

          Comment by Jonathan Lewis — September 1, 2016 @ 12:41 pm GMT Sep 1,2016


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.