Oracle Scratchpad

March 3, 2014

Flashback Fail ?

Filed under: Bugs,Flashback,Oracle,Troubleshooting — Jonathan Lewis @ 4:19 pm GMT Mar 3,2014

Sitting in an airport, waiting for a plane, I decided to read a note (pdf) about Flashback data archive written by Beat Ramseier from Trivadis.  I’d got about three-quarters of the way through it when I paused for thought and figured out that on the typical database implementation something nasty is going to happen after you;’ve been running with flashback data archive running for approximately 3 years and 9 months.  Can you guess why ?

It’s all about the table smon_scn_time – which normally records one row (created by smon) every five minutes with a continuous cycle of 24 hours – typically giving you about 1,440 rows in the table. The table is in an index cluster and the cluster key is the instance (thread) number. Clearly this was originally a clever idea from someone who realised that a cluster key of thread number would be beneficial if you had a RAC system with multiple instances – each instance gets its own blocks and the data for any one instance is as well clustered as possible.

The trouble is, when you enable flashback data archive smon no longer sticks to a 24 hour cycle, it just keeps adding rows. Now on my 8KB block tablespace I see 6 rows per block in the table/cluster – which means I get through 48 blocks per day,  17,520 blocks per year, and in 3 years and 9 months I’ll get to roughly 65,700 blocks – and that’s the problem.  An index entry in a cluster index points to a chain of cluster blocks, and the last two bytes of the “rowid” in the index entry identify which block within the chain the cluster key scan should start at – and two bytes means you can only have 65,536 blocks for a single cluster key.

I don’t know what’s going to happen when smon tries to insert a row into the 65,535th (-ish) block for the current thread – but it ought to raise an Oracle error, and then you’ll probably have to take emergency action to make sure that you can carry on using the flashback mechanisms safely.

Although oraus.msg indicates that it’s an error message about hash clusters it’s possible that the first clue you get will be: Oracle error: “ORA-02475 maximum cluster chain block count of %s has been exceeded”. If you’re using a 16KB block size then you’ve got roughly 7.5 years, and 32KB block sizes give you about 15 years (not that that’s a good argument for selecting larger block sizes, of course.)

Footnote:

Searching MoS for related topics (smon_scn_time flashback) I found doc ID: 1100993.1 from which we could possibly infer that the 1,440 rows was a fixed limit in 10g but that the number of rows allowed in smon_scn_time could increase in 11g if you enable automatic undo management. I also found a couple of bugs relating to index or cluster corruption – fixed by 11.2.0.4, though.

Update March 2019

As ever I re-discovered this note while searching for something completely different on my blog. Since it had turned up, and given that it made a prediction for (roughly) March 2017 I thought I ought to do a search on Mos for the ORA-02475 maximum cluster chain block count of %s has been exceeded error that I’d highlighted above to see if there had been any reports of problems with smon_scn_time around the date that I had predicted.

I found the following: How to recreate the smon_scn_time table so it’s not a clustered table to avoid ORA-2475 (Doc ID 2389804.1). As usual you can’t tell when it was written because Oracle only labels the documents with dates for “Last Update” and “Last Major Update” – in this case 14th March 2019. However the document links to: “BUG:22567457 – PERMANENT FIX FOR SMON_SCN_TIME LIMITS”, (mentioned in response in the conversation at comment #4 below), and is about a request for a permanent and official fix – not an actual fix.

Bug 22567457 reports a base bug of 16849652, but this is not a visible bug, although it looks like that’s the place where the workaround to rebuild the table first appears.

The bug/patch matrix for bug 22567457 shows no patch at present, and a quick check of an 18.3 database shows that the default configuration for the smon_scn_time table is still as a table clustered by thread.

It’s worth noting the following comment from Doc ID 2389804.1:

How to recreate the smon_scn_time table so it’s not a clustered table to avoid ORA-2475.  This should be done proactively if using Flashback Versions Query or Flashback Data Archive.

That’s my emphasis on “proactively”. The table, and its cluster, is (currently) created in script $ORACLE_HOME/rdbms/admin/dtxnspc.bsq so if you create databases by script you could edit this script to make the change semi-permanent. If you’re creating PDBs from a seed, though, you may have to find a different strategy to follow.

Update Jan 2020

A quick check on 19.3 shows that smon_scn_time is still an index-clustered table, clustered on the thread.

 

 

19 Comments »

  1. Jonathan, this sounds like a serious issue in an option we nearly went for at one point. Have Oracle responded?

    Comment by John Thomas — March 4, 2014 @ 10:39 am GMT Mar 4,2014 | Reply

    • It’s not mine – I don’t raise SRs – but here’s a very recent bug which looks like an exact match: “Bug 18294320 : ORA-1555 ON FDA TABLES” (The title doesn’t give an immediate clue – but I was searching the bug database on flashback and smon_scn_time). Reading the bug there’s a suggested temporary workaround of recreating the smon_scn_time table as a non-clustered table.

      I also infer from the content of the bug that you might not notice that you have a problem until you finally want to use the archive – unless you happen to have previously noticed that the smon trace file had grown very large dumping ORA-02475 error message.

      Comment by Jonathan Lewis — March 6, 2014 @ 10:57 am GMT Mar 6,2014 | Reply

  2. Bug 22456983 : LIMIT ON SMON_SCN_TIME AFFECTING FDA
    Unfortunately, the publicly visible bug text doesn’t provide enough information. Only that a 12c customer is unable to insert data !

    Comment by Hemant K Chitale — January 21, 2016 @ 9:32 am GMT Jan 21,2016 | Reply

  3. Oracle Support suggests to rename the mapping table, recreate it as a non-clustered and finally reload the data.

    Comment by Nenad Noveljic — August 22, 2016 @ 9:02 pm BST Aug 22,2016 | Reply

    • Nenad,

      Do you have a MoS document or bug number where that suggestion is officially made ? It would be nice to have a the reference for future readers.

      Comment by Jonathan Lewis — August 23, 2016 @ 1:18 pm BST Aug 23,2016 | Reply

      • Jonathan,
        First and foremost, thank you for, as always, an excellent analysis.
        Unfortunately, there is no official document for fixing the problem. However, I got the following procedure from Oracle Support after opening a service request:

        rename smon_scn_time to smon_scn_time_org 
         / 
         create table smon_scn_time 
         tablespace sysaux 
         as select * from smon_scn_time_org 
         / 
         drop index smon_scn_time_tim_idx 
         / 
         create unique index smon_scn_time_tim_idx on smon_scn_time(time_mp) 
         tablespace SYSAUX 
         / 
         drop index smon_scn_time_scn_idx 
         / 
         create unique index smon_scn_time_scn_idx on smon_scn_time(scn) 
         tablespace SYSAUX 
         / 
        

        Note, there would be a data loss between the first occurrence of ORA-02475 and applying the fix.
        Prior to obtaining the final fix, I was shown the content of the MOS How To Recreate SMON_SCN_TIME Table and Related Cluster ( Doc ID 837812.1 ) which turned out to be useless, because it just recreates the table without taking care of the old data.
        Oracle Development is currently working on a sustainable solution which will be implemented in one of the future releases. The bug is tracked under Bug 22567457 : PERMANENT FIX FOR SMON_SCN_TIME LIMITS.

        Comment by Nenad Noveljic — August 23, 2016 @ 5:26 pm BST Aug 23,2016 | Reply

        • Nenad,

          Thanks for the reply – there are a couple more notes on MoS which suggest the same strategy, I think I’ve used a search for smon_scn_time to find them in the past.
          One thought crosses my mind on reading the SQL you’ve posted, though. The indexes are both declared unique, but the table holds data from multiple instances if you’re running RAC and I can’t think of any good reason why two instances couldn’t record the same SCN or same time_mp at the same time.

          I’ll have to check it on a RAC system, but it looks as if uniqueness ought to be based on (thread, SCN) and (thread, time_mp); maybe Oracle has a lock that is taken by each instance in turn so that they can’t collide. Or maybe (since having multiple instances record the values) only one smon every populates the table.

          Comment by Jonathan Lewis — August 23, 2016 @ 5:52 pm BST Aug 23,2016

  4. Jonathan,
    I couldn’t find any publicly available MoS. Curiously, I got the procedure from the support only after an advanced support engineer got involved. The default answer from the support engineer was already mentioned MOS How To Recreate SMON_SCN_TIME Table and Related Cluster ( Doc ID 837812.1 ), which would have implied a data loss.
    As the SR is still open, I asked the questions about RAC. Could you imagine, that the smon just tries to insert the data, gets a constraint violation in the case that the SCN already ecists, and ignores the error?

    Comment by Nenad Noveljic — August 23, 2016 @ 7:34 pm BST Aug 23,2016 | Reply

    • Nenad,
      I won’t make any guesses until I’ve looked at a live RAC system, but the following is probably a good clue of how Oracle avoids collisions:

      
      SQL> select * from V$lock_type where lower( description) like '%smon%';
      
      TYPE NAME                             ID1_TAG                          ID2_TAG                          IS_
      ---- -------------------------------- -------------------------------- -------------------------------- ---
      DESCRIPTION
      ------------------------------------------------------------------------------------------------------------------------------------
      TD   KTF map table enqueue            0                                0                                NO
      Serializes updates and inserts to the SMON_SCN_TIME mapping table
      
      TE   KTF broadcast                    0                                0                                NO
      Serializes broadcasts for flushes to SMON_SCN_TIME
      
      SM   SMON Serialization               0                                0                                NO
      Lock to check SMON global work in RAC
      
      

      Comment by Jonathan Lewis — August 23, 2016 @ 7:45 pm BST Aug 23,2016 | Reply

      • Jonathan,
        Here is what Oracle Support has said about avoiding collisions in RAC:
        “The SCN changes whenever the database issues a COMMIT statement and it’s a system-wide value stored in the control files for each node. In a RAC cluster, each of the nodes generate its own SCN and then synchronizes it to the rest of the RAC nodes. The Lamport algorithm works with system change numbers (SCN) and the distributed lock manager (DLM) to control and lock management and conflict resolution.
        From Oracle 10gR2 onwards, the LGWR background process will broadcast the current SCN to all the instances (nodes) in the cluster, before committing any transaction. “

        Comment by Nenad Noveljic — August 25, 2016 @ 2:34 pm BST Aug 25,2016 | Reply

        • I’m never too sure when talking to Oracle support whether I’ve got someone who actually knows what they’re talking about, or whether it’s someone who has read a note that isn’t visible to customers – but that comment about the SCN being stored in the control files doesn’t sound likely – you don’t want EVERY commit to have to lock and then wait to write to a control file, and you don’t want every commit to have to re-read every node’s “commit record” before deciding what its SCN should be. On the other hand you do need to resynchronize the instance synchronized so that they are all running at the same SCN as much as possible – hence the broadcast on commit.

          That doesn’t stop two (or more) instances attempting to commit at the same instant and then causing a (possibly very small) delay if they have to try to resolve the conflict. As far as the SMON_SCN_TIME update is concerned, though, that collision can be pre-empted by something like the TD lock above – if it’s a global lock (hence the reference to the DLM) each instance can guarantee that it’s the only one holding it exclusive when they update the table, and it gives the instance a mechanism to guarantee that it gets itself to a new (global) SCN before inserting a new row.

          On the other hand, the time_mp column (used for the other unique index) is the number of seconds since 1st Jan 1970. So if there’s a delay mechanism trying to guarantee uniqueness then it’s a pretty long delay (one second per node). I’ll have to try it, but I think the solution may simply be that each smon doesn’t even try to insert a row if there’s a collision (the TD lock, if that’s the one used, could carry the time_mp value from the previous insert), or maybe it allows the “duplicate key” error and ignores it.

          Comment by Jonathan Lewis — August 28, 2016 @ 9:56 am BST Aug 28,2016

        • Jonathan,
          Thank you for sharing the insights.

          Comment by Nenad Noveljic (@NenadNoveljic) — August 31, 2016 @ 11:21 pm BST Aug 31,2016

        • Nenad,

          Just a couple of details – though nothing like a proper description – for RAC:

          Despite the THREAD column suggesting the each instance keeps it’s own set of rows the thread is always set to zero whichever thread inserts or updates.

          In general there is only one insert for every 5 minutes, but the smons can get a little out of sync, and there is some code that runs that seems to check if the smon that inserted a row has inserted “enough” SCN mapping values and if not a second smon can insert another row for the same time period (even though the time stamp and SCN will be slightly different), (For example I have an smon_scn_time table with two rows for 10:14:20 and 10:14:27, the first has num_mappings = 0 the second has num_mappings = 89. Whenever I see two for a time period the first one seems to have zero num_mappings – but that may be a coincidence.

          I’ve also noted that the time_dp can slip from the date/time supplied by sysdate by a significant margin (perhaps smon is calling vktm instead of calling the system time for its map).

          Comment by Jonathan Lewis — September 1, 2016 @ 12:41 pm BST Sep 1,2016

  5. Hello Jonathan,

    It looks to me that there is a contradiction between Oracle’s internal suggestion to rebuild the SMON_SCN_TIME as a non-clustered table and the documentation for ORA-02475, which indicates “Increase SIZE parameter in CREATE CLUSTER statement” as one of the possible ways to work-around this error.

    The documentation for clustered tables does not say too much about any limitation of the total number of rows that can be stored for each cluster key value. It, instead, explicitly specifies the opposite: “SIZE does not limit the space that can be used by a given cluster key.”.

    If the ROWID in a cluster index entry contains the starting block of the chain of blocks that store the rows for a given cluster key, then, if I understand it correctly, this “starting block” should be the same for all the index entries for the same cluster key value. But I don’t exactly understand why does this mean that the total number of the blocks in the chain should be limited.

    Thanks a lot in advance if you can share a little bit more light on this topic.

    Best Regards,
    Iudith Mentzel

    Comment by Iudith Mentzel — January 28, 2020 @ 2:32 pm GMT Jan 28,2020 | Reply

  6. Iudith,

    Thanks for commenting.
    A common problem with documents and notes on MOS is that they leave space for ambiguity, and even when you check the reference manuals the ambiguities are not necessarily resolved.

    The first point, of course, is that this error number is “supposed” to be for HASH clusters, but will appear with INDEX clusters (which is what the smon_scn_time’s cluster is). So there’s scope for confusion is the parameters for the two different types of cluster aren’t completely consistent.

    The SIZE parameter is supposed to sepcify the space requirement for all the rows for a single cluster key; it has a default (and maximum) value of the block size of its holding tablespace. You may want to specify a smaller value because you know that you should be able to fit (say) 6 cluster keys with all their rows in a single block and it might be silly to reserve a whole block for each cluster key.

    The context of “SIZE doesn not limit the space” is “within a block” If you’ve set a SIZE based on the typical cluster key having 6 rows and 360 bytes assoicated with it (say you’ve set size to 400) then you may have some cluster keys that have 20 rows and need 1200 bytes – and Oracle will happily store all those rows for the key, and if the space happens to be available in the first block for that key they will all go into that block. (This may mean of the other cluster keys that “belongs” in that block may have to chain to a second block and store some rows there becuase the block becomes prematurely full.)

    If you’ve miscalculated badly and decided every cluster key needs only 1,000 bytes and set the SIZE accordingly (so you expect each block to hold 8 key values), but every cluster key needs 70MB and the data is arriving in a very even pattern, then each cluster key will keep chaining to more and more blocks and (hypothetically) the 8 cluster keys that started in block “X” of the table will end up using 1,000 bytes each from each of 70,000 blocks – expect that before Oracle gets to that position some inserts will fail with error “ORA-02475: maximum cluster chain block count of 65535 has been exceeded.” In that (highly unlikely) set of circumstances you could rebuild the cluster and set the size to its maximum (i.e. let it default to the block) and use about 8,000 bytes from each block and get the 70MB of data per key chaining to “only” 8,500 blocks. (In this case you’ve probably made a bad choice choosing a simple cluster for the table.)

    The rowid question: a little bit of ambiguity of my own: “An index entry in a cluster index points to a chain of cluster blocks, and the last two bytes of the “rowid” in the index entry identify which block within the chain the cluster key scan should start at”. The scan starts at the END (newest insert) not the beginning, so it’s effectively a counter of the number of blocks in the chain so 2 bytes = 65535. I’ve never investigated closely what happens if you’re constantly deleting and inserting rows – whether Oracle is clever about trying to reuse blocks in the right chain that have space or whether the chain just keeps growing and leave you in a position where you couldn’t insert a row for a cluster key because it had 65535 blocks in the chain which were essentially empty.

    Regards
    Jonathan Lewis

    Comment by Jonathan Lewis — January 28, 2020 @ 3:52 pm GMT Jan 28,2020 | Reply

  7. Hello Jonathan,

    Thanks a lot for the so detailed explanation.

    What I think that what we miss is a detailed analysis of the content of a ROWID stored in an index cluster, when compared to ROWID-s stored in a normal B*TREE index on a non-clustered table.

    For a “normal” ROWID, there is a lot of technical explanation regarding the ROWID structure and its components, but there is no similar explanation for the ROWID-s stored in a cluster index, if it is different from a “normal” ROWID.

    The Concepts guide only shortly says the following:
    “The B-tree cluster index associates the cluster key value with the database block address (DBA) of the block containing the data.”.

    A natural conclusion from here would be that the cluster index will contain as many entries for each cluster key value as there are blocks occupied by rows belonging to that cluster key, each index entry pointing to one such block.

    But, from this short definition, I still don’t see any reason for having any limitation on the number of rows that can be stored for a cluster key value.

    At most, for making the cluster perform more efficiently, if specifying a high SIZE value, Oracle might consider reserving several physical contiguous blocks for each cluster key, to attempt to keep the data physically clustered by the key.

    In case of a SIZE miscalculation, the only result I could expect would be either a waste of space (if SIZE was too big) or a less optimal storage (if SIZE was too small).

    Many thanks & Best Regards,
    Iudith

    Comment by Iudith Mentzel — January 28, 2020 @ 4:51 pm GMT Jan 28,2020 | Reply

    • Iudith,

      I think the only place I’ve seen any detailed description of how the index of an index cluster operates is in Practical Oracle 8i.

      A couple of key points that impact the limit:
      a) there’s only one index entry for each cluster key value
      b) the maximum SIZE declaration is the size of the Oracle block.

      As you say, it sounds like a good idea to allow the SIZE to cater for multiple Oracle blocks (at the risk of a lot of wasted space). Amongst other things it would improve the clustering of related data when you had a reasonable number of rows per key, and I would have taken advantage of index clusteres a couple of times in the past if that had been the case, but Oracle simply doesn’t do that. Index clusters really haven’t changed very much since v6, and hash clusters follow many of the same principles when the number of rows per key gets too large.

      Just FYI, I’ve created a clustered table with a rowsize of 2100 bytes – which allows 3 rows per block – then inserted 100 rows. There’s one index entry in the cluster index at this point:

      row#0[8019] flag: -------, lock: 2, len=13, data:(8):  05 80 00 ab 00 00 22 00
      col 0; len 2; (2):  c1 02
      

      The cluster key value is the number 1 (internal representation c1 02).
      The rowid is 8 bytes – the first six bytes give the file, block, and row number within block of the cluster key directory entry. (If the cluster holds N tables then any block in the table will have up to N+1 row directories, one for the rows that are the cluster keys, and one for each of the tables in the cluster that has rows stored in that block.

      The last two bytes tell us that the index is pointing to the 34th (0x22) block in the chain of blocks that hold that cluster key.

      After I insert 4 more rows with the same cluster key into the table the one index entry looks like this:

      
      row#0[8019] flag: -------, lock: 2, len=13, data:(8):  05 80 00 ac 00 00 23 00
      col 0; len 2; (2):  c1 02
      
      

      The rowid is now pointing to a different block in the cluster (in fact it’s the next block in the segment), but the cluster key value is (again) the first (00 00) entry in the row directory for cluster keys in that block. But the last two bytes now show that the index is pointing at the 35th (0x23) block in the chain of blocks for that cluster key.

      And here’s the cluster key row from one of the blocks in the cluster to show what they look like:

      
      tab 0, row 0, @0x1f82
      tl: 22 fb: K-H-FL-- lb: 0x0  cc: 1
      curc: 3 comc: 3 pk: 0x0580009c.0 nk: 0x058000a3.0
      col  0: [ 2]  c1 02
      
      

      The cluster keys are always “table 0” in the blocks of a cluster, and since all my data uses only a single key value I’ve only got a “row 0” which has the value 1 (c1 02).

      The curc: 3 comc: 3 tell you there are three rows in the block for that key (I haven’t investigated what the two different numbers mean – possibly (probably) current rows, committed rows. But then we have the pk/nk – which to me is “previous key”/”next key” – the blocks for a given cluster key value are arranged in a doubly linked list so when we’ve examined this block we know how to go forward or backward to the next block that has rows for this key value. (The doubly-linked may be a mechanism that allows Oracle to be more efficient when it wants to unlink blocks where all the rows for that key have been deleted.

      Regards
      Jonathan Lewis

      P.S. Don’t ask too many question about fine details for the above – it’s been about 20 years since I last looked at it.

      Comment by Jonathan Lewis — January 28, 2020 @ 6:49 pm GMT Jan 28,2020 | Reply

  8. Hello Jonathan,

    Thank you so much for all these details :)

    From this structure it looks like Oracle is in fact able to find all the blocks for a given key value by using only the first 6 bytes of the ROWID, and from there on using the PK/NK information from the doubly-linked list, that is, even without considering at all to which block exactly in the chain are the last two bytes of the ROWID pointing (the last one, the first one, or even a “middle” one).

    It looks to me that, for some reason, index clusters are very rarely used, so maybe this explains why there is so few information about how exactly they work.

    The fact that Oracle only creates one index entry for each cluster key value does suggest that originally the index cluster was probably designed based on the idea of having at most one single block of data for each cluster key.

    Thanks a lot one again & Best Regards,
    Iudith Mentzel

    Comment by Iudith Mentzel — January 28, 2020 @ 8:45 pm GMT Jan 28,2020 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.