It doesn’t matter which bit of Oracle technology you want to use, eventually someone, somewhere, runs into the special case where something nasty happens. Here’s an edge case for people using (index) clusters – Oracle Bug 17866999 ora-1499 for cluster following rman convert
It comes from a conversation on Oracle-L where Jack van Zanen reported a problem of inconsistent results after migrating data between platforms using rman to converts some tablespaces containing index clusters. This is the starting post where he shows a query that is clearly getting the wrong answer (select where channel_number = 503 obviously shouldn’t return data with channel_number 501).
Unfortunately the “audit-trail” on this problem is incomplete because I accidentally started sending email to Jack instead of the list (reply, rather than reply all) – but given that the problem column was the cluster key of an index cluster the obvious guess was that something had gone wrong as the data for this key value had grown and chained into a second (or third) block. To confirm this I asked Jack to select the rowids reported for the cluster and use the dbms_rowid package to convert them into distinct (file_number, block_number) values:
select dbms_rowid.rowid_relative_fno(rowid) file_no, dbms_rowid.rowid_block_number(rowid) block_no, count(*) rows_in_block from &m_table t1 group by dbms_rowid.rowid_relative_fno(rowid), dbms_rowid.rowid_block_number(rowid) order by dbms_rowid.rowid_relative_fno(rowid), dbms_rowid.rowid_block_number(rowid) ;
With a few file and block numbers in hand, we then needed to see the results from dumping the blocks. Jack sent me the trace files, and I found that my guess was correct – and also discovered that I had been on a private discussion, so I posted the results back to the list. As you can see from that posting, the cluster key (indicated by the initial K in the flag byte, and by its presence in “tab 0” in the block) is 4 columns long, and I’ve got entries from two blocks where the cluster keys are pointing to each other – but the cluster keys don’t hold the same value.
It’s an extraordinary corruption – the data structures appear to be totally self-consistent and perfectly clean, and yet the data content is wrong. (Usually I find that a corruption means the data becomes unusable).
Oracle eventually tracked this down and reproduced it; it’s a problem that can arise if you are using index clusters and use rman to do a cross-platform conversion – and it’s not going to be easy to fix it. I don’t suppose many people are likely to see this bug (it’s been around for years, apparently, but there aren’t many people who use clusters, and of those perhaps very few have migrated between platforms using rman) – but at present the only workaround is to export and reimport the data; which is not nice when you’re talking about terabytes of clusters.
Update (Feb 2020)
I question came up on the Oracle Developer Community forum today that prompted me to check something I had written about clusters, which is how I came to rediscover this note. Curiosity prompted me to check on the status of the bug mentioned above (17866999) and I discovered that Oracle had supplied a patch on 25th Feb 2014, just over 3 weeks after I wrote this note – so really quite a prompt response.
However, I then read the “Rediscovery Information” for the bug and found the following:
NOTE: This bug fix provides a method of correcting the data so that the error is no longer seen. It does not affect the RMAN CONVERT.
It looks like the patch doesn’t eliminate the problem, it just lets you clear up the mess afterwards :(
Maybe there’s a later patch/upgrade that makes the problem go away completely, but the bug note itself flags the problem as fixed in 12.2
Jonathan,
We’re not usually so close to the bleeding edge as this – RMAN cross-platform conversion on index clusters – but “It doesn’t matter which bit of Oracle technology you want to use, eventually someone, somewhere, runs into the special case where something nasty happens” seems to be a frequent experience. We ran into a show-stopper in RMAN that meant a test run produced 100,000 spurious errors due to a defect in replay where CURSOR_SHARING=FORCE (don’t dive in, I know we shouldn’t use it, but we have some antique apps in technologies that will not be updated to use binds properly).
Probably held us up for a number of months in total. But I don’t think it’s the only case of this I’ve run into in recent years. I wonder whether Oracle will/can do anything to reduce the number of defects in “edge cases” that are actually core paid-for technology.
Comment by John Thomas — February 2, 2014 @ 10:43 pm GMT Feb 2,2014 |