Oracle Scratchpad

August 24, 2020

Flashback Bug

Filed under: 18c,Bugs,Oracle,redo,Troubleshooting — Jonathan Lewis @ 9:34 am BST Aug 24,2020

Here’s a problem with the “flashback versions” technology that showed up at the end of last week. There’s a thread about it on the Oracle Developer community forum, and a chain of tweets that was my initial response to a twitter alert about it that Daniel Stein posted.

The problem appears somewhere in the 18c timeline – it doesn’t seem to be present in 12.2.0.1 – so if you’re running any versions 18+ here’s a modified version of the test case supplied by Daniel Stein to demonstrate the issue.

rem
rem     Script:         flashback_bug.sql
rem     Author:         Jonathan Lewis / Daniel Stein
rem     Dated:          Aug 2020
rem
rem     Last tested 
rem             19.3.0.0        Wrong result
rem             12.2.0.1        Correct result
rem

create table t1 (n1 number(4)) 
segment creation immediate
;

prompt  =============================================================
prompt  Sleeping 10 seconds after create table to avoid subsequent
prompt  ORA-01466: unable to read data - table definition has changed
prompt  =============================================================


execute dbms_lock.sleep(10)

prompt  =========
prompt  Start SCN
prompt  =========

column current_scn new_value scn_vorher2 
select to_char(current_scn,'9999999999999999') current_scn from V$DATABASE;

insert into t1(n1) values (1);
insert into t1(n1) values (2);
insert into t1(n1) values (3);
insert into t1(n1) values (4);
insert into t1(n1) values (5);
insert into t1(n1) values (6);
insert into t1(n1) values (7);
insert into t1(n1) values (8);
insert into t1(n1) values (9);
insert into t1(n1) values (10);
insert into t1(n1) values (11);
insert into t1(n1) values (12);

delete from t1 where n1 in (2, 5, 8, 11);

commit;

prompt  =======
prompt  End SCN
prompt  =======

column current_scn new_value scn_nachher
select to_char(current_scn,'9999999999999999') current_scn from V$DATABASE;

prompt  =============
Propmt  Version Query 
prompt  =============

column VERSIONS_STARTSCN        format 9999999999999999 heading "Start SCN"
column VERSIONS_ENDSCN          format 9999999999999999 heading "End SCN"
column VERSIONS_OPERATION       format A9               heading "Operation"
column SCN_RANGE                format A35              heading "SCN Range"

select 
        n1, rowid, 
        versions_operation, versions_startscn, versions_endscn,
        &scn_vorher2||' and '||&scn_nachher scn_range
from
        t1 versions between scn &scn_vorher2 and &scn_nachher
order by 
        rowid, versions_startscn, versions_operation, versions_endscn
;

prompt  ==================
prompt  Current Table data
prompt  ==================

select n1,rowid from t1;


alter system dump redo scn min &scn_vorher2 scn max &scn_nachher;

I’ve created a table then inserted a few rows and deleted some of them in the same transaction. I’ve captured the database SCN at the start and end of the transaction and then tried to run a “versions between” query across that range of SCNs. Here are the results from the “versions between” query and the final query of the current contents of the table on a test on a database running 19.3:

        N1 ROWID              Operation         Start SCN           End SCN SCN Range
---------- ------------------ --------- ----------------- ----------------- -----------------------------------
         1 AAAXFOAATAAAACDAAA I            12670408139684                   12670408139679 and 12670408139687
         3 AAAXFOAATAAAACDAAC I            12670408139684                   12670408139679 and 12670408139687
         4 AAAXFOAATAAAACDAAD I            12670408139684                   12670408139679 and 12670408139687
         5 AAAXFOAATAAAACDAAE I            12670408139684                   12670408139679 and 12670408139687
         6 AAAXFOAATAAAACDAAF I            12670408139684                   12670408139679 and 12670408139687
         7 AAAXFOAATAAAACDAAG I            12670408139684                   12670408139679 and 12670408139687
         8 AAAXFOAATAAAACDAAH I            12670408139684                   12670408139679 and 12670408139687
         9 AAAXFOAATAAAACDAAI I            12670408139684                   12670408139679 and 12670408139687
        10 AAAXFOAATAAAACDAAJ I            12670408139684                   12670408139679 and 12670408139687
        11 AAAXFOAATAAAACDAAK I            12670408139684                   12670408139679 and 12670408139687

10 rows selected.

        N1 ROWID
---------- ------------------
         1 AAAXFOAATAAAACDAAA
         3 AAAXFOAATAAAACDAAC
         4 AAAXFOAATAAAACDAAD
         6 AAAXFOAATAAAACDAAF
         7 AAAXFOAATAAAACDAAG
         9 AAAXFOAATAAAACDAAI
        10 AAAXFOAATAAAACDAAJ
        12 AAAXFOAATAAAACDAAL

8 rows selected.

There’s a trap in this demonstration. If you follow the link to the forum thread you’ll find that Daniel Stein says there’s a problem with Oracle 19.6 and upwards (but not 12.2) – then various other people test on their versions and find that 19.7, 18.5, and 19.3 are broken. That last one is what made me run the test on the 19.3 I had at hand – and I got the right result, unlike the results above.

If you look at the output from this test run you’ll see that the versioning query says that in the SCN range I’ve inserted 10 rows, and that rows 2 and 12 have never been inserted. Given the SQL I’d run I was expecting to see 16 rows in the output, 12 of them saying I’d inserted values 1 through 12, and then 4 saying I’d deleted 2, 5, 8 and 11 – but (when it works correctly) Oracle manages to combine the insert and delete in the same transaction to cancel the entries. So the versioning query ought to be showing exactly the 8 rows that appear in the current data query.

So I had two puzzles:

  • What’s gone wrong for everyone else ?
  • Why hasn’t it gone wrong for me ?

To answer the second question first: I took a look at my instance parameters and found that I had set statistics_level to all. Clearly this ought to have nothing to do with an error in flashback version queries, but I changed it to typical anyway and re-ran the test: and got the (now expected) wrong results. If you read the twitter thread you’ll find that Oracle support had had an SR opened on this problem for a few days but had been unable to reproduce the problem – perhaps they, too, had a system with statistics_level set to all.

Getting back to the first question: what’s gone wrong. I couldn’t think of a good reason why setting the statistics_level should make a difference, so I took the view that the versioning query is running through the redo log to find all the changes that took place in the range, so maybe the answer will appear in the redo. That’s why the last line in the demo above is a command to dump the redo generated between the two SCNs.

Short-circuiting the details of what I looked at I’m just going to show you the layer 11 (table) redo change vectors for my DML, which I extracted from the trace file using the command grep “OP:11” {filename}.

First the set when statistics_level = all:


CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1b SEQ:2 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1b SEQ:3 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:2 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:3 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:4 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:5 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:6 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:7 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:8 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:9 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:10 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:3 TYP:2 CLS:1 AFN:10 DBA:0x01009c57 OBJ:720 SCN:0x00000b860f2d3cb8 SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1c SEQ:11 OP:11.3 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1d SEQ:1 OP:11.3 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1d SEQ:2 OP:11.3 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1d SEQ:3 OP:11.3 ENC:0 RBL:0 FLG:0x0000

What you can see are 13 op codes “11.2” – which is “insert row piece”. Most of these are for OBJ 94554, which is the object_id for my table, and one of them if for OBJ 720, which is a table owned by sys called exp_head$ (which is something to do with “expression tracking”). After the inserts there are 4 op Ccodes “11.3” which is “delete row piece”. So there’s no obvious error or complication in the redo.

And now when we set statistics_level to typical and re-run the script (which means we get a new OBJ number):

CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8a SEQ:2 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8a SEQ:3 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:2 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:3 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:4 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:5 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:6 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:7 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:8 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:9 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:10 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:3 TYP:2 CLS:1 AFN:10 DBA:0x01009c57 OBJ:720 SCN:0x00000b860f2d3d1d SEQ:1 OP:11.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:11 OP:11.12 ENC:0 RBL:0 FLG:0x0000

In this case we get 13 “insert row piece” Op Codes, and then a single “11.12” which is “delete multiple rows”. So there IS a difference in the redo, and something about that difference is probably behind the error coming and going.

Taking a closer look at the 11.12 and one of the 11.2 redo change vectors, this is what we see:


CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94545 SCN:0x00000b860f2d3d8b SEQ:11 OP:11.12 ENC:0 RBL:0 FLG:0x0000
KTB Redo
op: 0x02  ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C  uba: 0x04400a24.0672.3c
KDO Op code: QMD row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x04c00083  hdba: 0x04c00082
itli: 1  ispac: 0  maxfr: 4858
tabn: 0 lock: 1 nrow: 4
slot[0]: 1
slot[1]: 4
slot[2]: 7
slot[3]: 10


CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:19 DBA:0x04c00083 OBJ:94544 SCN:0x00000b860f2d3d1d SEQ:3 OP:11.3 ENC:0 RBL:0 FLG:0x0000
KTB Redo
op: 0x02  ver: 0x01
compat bit: 4 (post-11) padding: 1
op: C  uba: 0x04400467.0595.2c
KDO Op code: DRP row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x04c00083  hdba: 0x04c00082
itli: 1  ispac: 0  maxfr: 4858
tabn: 0 slot: 10(0xa)

The 11.12 Op Code shows us a list of 4 entries in the block’s “row directory” for “table 0”. THe 11.3 Op Code shows us just one – it’s the last of the four deletes so it’s reporting entry 10 in the “row directory”, the previous three Op Code 11.3 vectors were reporting slots 1, 4, and 7 respectively.

Obviously we can’t draw any firm conclusions about what’s going wrong, we simply have a few observations that might give us some ideas of the type of error; and after a few more experiments I decided that the code handling flashback versioning mis-interpreting the “delete multiple row” record. It consistently seemed to delete the first row identified the slot[0] entry and then jumped straight to the last slot but add one to the index entry before attempting to apply the vector.

Footnote

I’ve used dbms_lock.sleep() to pause for 10 seconds after creating the table. The sleep() procedure was copied into the dbms_session package in recent versions of Oracle so that end-user code could access it without having to receive privileges on the rather more dangerous dbms_lock package.

I’ve referenced the statistics_level parameter as the one affecting the generation of the “delete multiple row” redo vector. In fact you can get the same effect by setting the hidden parameter “_rowsource_execution_statistics” to true. Setting statistics_level to all has the effect of enabling rowsource execution statistics so this isn’t surprising.

I pointed out that the flashback versioning code seemed to “add 1” to the final slot[n] identifier before producing the flashback output. Think about that that means if you change the demonstration to delete the LAST row in the list of rows inserted. Here’s what mning query did in a test when I changed 11 to 12 in my delete statement:


ERROR at line 2:
ORA-00600: internal error code, arguments: [ktrvtgr_2], [79691907], [0], [11], [], [], [], [], [], [], [], []

79691907 is the block number of the block where the table rows are stored.
11 is the correct row directory index for the last row (12) in the table – buy my hypothesis is that the flashback code was trying to find (non-existent) directory index 12 because it adds one to the supplied index entry.

Update

In the interval between completing the final draft yesterday and publishing the note this morning, Oracle support replied to the the forum thread with a bug number (28910586 – not yet publicly visible) and a note that a patch could be available on request if an SR were raised giving the Oracle version and platform. It will be interesting to see whether the patch blocks the creation of the 11.12 op codes or whether it corrects the interpretation of 11.12 codes by the flashback versions code.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.