A thread started on the Oracle-L list-server a few days ago asking for help analysing a problem where a simple “insert values()” (that handled millions of rows per day) was running very slowly. There are many reasons why this might happen, ranging from the trivial (someone has locked the table in exclusive mode), through the slightly subtle (we’re trying to insert a row that collides on a uniqueness constraint with an uncommitted insert from another session) to the subtle (Oracle has to read through the undo to check current versions of blocks against read-consistent versions) ending up at the esoteric (the ASSM space management blocks are completely messed up again).
A 10046 trace of a session doing an insert showed only that there was a lot of time spent on single block reads. Unfortunately, since this was on an Exadata system the waits were reported as “cell single block physical read”. Unfortunately the parameters to this wait event are “cellhash#”, “diskhash#”, and “bytes” and we don’t see the file_id, block_id which can be very helpful for a case like this. The only information we got from the trace file was that the object_id was for the table were rows were being inserted.
Before digging into exotic debugging methods, the OP supplied us with a 1-second session report from Tanel Poder’s snapper script:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > SID @INST, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH , NUM_WAITS, WAITS/SEC, AVERAGES > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > 1070 @2, SYSADMIN, STAT, session logical reads , 13865, 7.73k, , , , , 14.1k total buffer visits > 1070 @2, SYSADMIN, STAT, user I/O wait time , 141, 78.65, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, non-idle wait time , 141, 78.65, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, non-idle wait count , 12230, 6.82k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, physical read total IO requests , 6112, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, physical read requests optimized , 6111, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, physical read total bytes optimized , 50069504, 27.93M, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, physical read total bytes , 50069504, 27.93M, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, cell physical IO interconnect bytes , 50069504, 27.93M, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, gcs messages sent , 3, 1.67, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, db block gets , 13860, 7.73k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, db block gets from cache , 13860, 7.73k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, db block gets from cache (fastpath) , 7737, 4.32k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, consistent gets , 1, .56, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, consistent gets from cache , 1, .56, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, consistent gets pin , 1, .56, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, consistent gets pin (fastpath) , 1, .56, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, logical read bytes from cache , 113541120, 63.34M, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, physical reads , 6111, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, physical reads cache , 6111, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, physical read IO requests , 6112, 3.41k, , , , , 8.19k bytes per request > 1070 @2, SYSADMIN, STAT, physical read bytes , 50069504, 27.93M, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, db block changes , 11, 6.14, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, free buffer requested , 6112, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, hot buffers moved to head of LRU , 958, 534.39, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, free buffer inspected , 6144, 3.43k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, shared hash latch upgrades - no wait , 7, 3.9, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, blocks decrypted , 6110, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, redo entries , 6120, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, redo size , 465504, 259.67k, , , , , ~ bytes per user commit > 1070 @2, SYSADMIN, STAT, redo entries for lost write detection , 6110, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, redo size for lost write detection , 464756, 259.25k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, redo subscn max counts , 7, 3.9, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, file io wait time , 1408659, 785.78k, , , , , 230.47us bad guess of IO wait time per IO request > 1070 @2, SYSADMIN, STAT, gc current blocks received , 3, 1.67, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, gc local grants , 6116, 3.41k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, ASSM cbk:blocks examined , 12366, 6.9k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, ASSM gsp:L1 bitmaps examined , 2478, 1.38k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, ASSM gsp:L2 bitmaps examined , 1, .56, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, ASSM gsp:reject db , 12388, 6.91k, , , , , ~ per execution > 1070 @2, SYSADMIN, STAT, buffer is pinned count , 230, 128.3, , , , , 1.63 % buffer gets avoided thanks to buffer pin caching > 1070 @2, SYSADMIN, STAT, cell flash cache read hits , 6723, 3.75k, , , , , ~ per execution > 1070 @2, SYSADMIN, TIME, background cpu time , 365192, 203.71ms, 20.4%, [## ], , , > 1070 @2, SYSADMIN, TIME, background elapsed time , 1273623, 710.45ms, 71.0%, [######## ], , , 28.95 % unaccounted time > 1070 @2, SYSADMIN, WAIT, gc current block busy , 629, 350.87us, .0%, [ ], 3, 1.67, 209.67us average wait > 1070 @2, SYSADMIN, WAIT, cell single block physical read , 1557638, 868.88ms, 86.9%, [WWWWWWWWW ], 6746, 3.76k, 230.9us average wait > > -- End of Stats snap 1, end=2019-05-18 12:58:58, seconds=1.8
My first step was simply to read down the list (using a very small font to get the entire width on screen without wrapping) to see if anything stood out as unusual. The report showed two things I rarely see in the session stats:
blocks decrypted , 6110 redo entries for lost write detection , 6110
These stats tell me that there are two “uncommon” features enabled: db_lost_write_protect, and block level encryption. (So whatever else is going on it’s just possible that mixing in two rarely used – and therefore less frequently tested – features may be confusing the issue.
Lost write protection means Oracle writes a “block read record” (BRR) to the redo log every time it reads a block from disc, so I decided to follow up the 6,110 figure to see what other stats reported similar values.
physical read total IO requests , 6112 physical read requests optimized , 6111 physical reads , 6111 physical reads cache , 6111 physical read IO requests , 6112 free buffer requested , 6112 redo entries , 6120 redo entries for lost write detection , 6110 gc local grants , 6116 cell flash cache read hits , 6723
There’s nothing particularly surprising here – basically we see all the blocks being read as single block reads, into cache. All the necessary global cache (gc) grants are local so it’s possible the table of interest has been remastered to this node. The value for “cell flash cache read hits” look a little odd as the cache is hit more frequently than blocks are read – but dynamic performance views are not read-consistent and this session is hammering away like crazy so this might just be a side effect of the time to gather the data for the report.
We can chase the redo a little further – the number of redo entries is slightly larger than the number of blocks read, so (even though small inconsistencies are not necessarily meaningful) this might tell us something:
redo entries , 6120 redo size , 465504 redo entries for lost write detection , 6110 redo size for lost write detection , 464756 db block changes , 11
The number of “redo entries” that were NOT for lost write detection is 10, totalling 748 bytes (not a lot – so indicative of “non-user” activity). The number of “db block changes” is 11 (close enough to 10), and generally it’s changes to db blocks that require redo to be generated. The final significant number is the one that isn’t there – there’s no undo generated, so no user-change to data. This system is working like crazy achieving absolutely nothing at this point.
The next point to ponder is what sort of work it is doing – so let’s check how the physical reads turn into buffer gets.
session logical reads , 13865 db block gets , 13860 db block gets from cache , 13860 db block gets from cache (fastpath) , 7737 consistent gets , 1 consistent gets from cache , 1 consistent gets pin , 1 consistent gets pin (fastpath) , 1 hot buffers moved to head of LRU , 958 buffer is pinned count , 230
The unusual thing you notice with these figures is that virtually every buffer get is a current get. We’ve also got a number of blocks pinned – this might just be the segment header block, or the segment header and level 2 bitmap block that we keep revisiting. Finally we can see a lot of hot buffers being moved to the head of the LRU; since our session has been doing a lot of work for a long time it seems likely that those buffers are ones that our session is keeping hot – and for a big insert that shouldn’t really be happening unless, perhaps, we were managing to do a lot of maintenance of (well-clustered) indexes.
I’ve isolated the (new in 12.2) “ASSM gsp (get space)” statistics from this output – they’re all about handling blocks, but I wanted to look at them without being distracted by other stats.
ASSM cbk:blocks examined , 12366 ASSM gsp:L1 bitmaps examined , 2478 ASSM gsp:L2 bitmaps examined , 1 ASSM gsp:reject db , 12388
We can see that we’ve examined 2,478 “level 1” bitmap blocks. A level 1 block holds the basic “bitmap” that records the state of a number of data blocks (typically 128 blocks once the object gets very large) so our session has worked its way through 2,478 maps trying to find a data block that it could use to insert a row. The “reject db” statistic tells us about data blocks that have been examined and rejected (presumably because the row we want to insert is too large to fit, or maybe because there are no free ITL (interested transaction list) entries available in the block). So we seem to be spending all our time searching for somewhere to insert rows. This shouldn’t really be happening – it’s a type of problem that Oracle has been worrying away at for quite some time: how do you avoid “losing” space by updating bitmap blocks too soon on inserts without going to the opposite extreme and leaving bitmap blocks that claim the space is free when it’s in use by uncommitted transactions.
Note to self: I don’t know how we managed to reject more blocks (12,388) than we’ve examined (12,366) but possibly it’s just one of those timing glitches (the error is less than one fifth of one percent) , possibly it’s something to do with the reject count including some of the L1 bitmap blocks.
Clearly there’s something funny going on with space management – and we need to look at a few blocks that are exhibiting the problems. But how do we find a few suitable blocks? And that’s where, finally, we get to the title of the piece.
We are in the lucky position of having “lost write protection” enabled – so the redo log file will hold lots of “block read records”. We can’t get the file and block addresses we need from the “cell physical read” wait events so let’s ask the redo log to supply them. We just have to pick a log file (online or archived) and tell Oracle to dump some of it – and we can probably get away with a fairly small dump since we want just a single type of redo record over a short period of time. Here’s an example showing the format of two slightly different commands you could execute:
alter system dump logfile '/u01/app/oracle/oradata/orcl12c/redo03.log' rba min 2781 1 rba max 2781 1000 layer 23 opcode 2 ; alter system dump redo scn min 19859991 scn max 19964263 layer 23 opcode 2 ;
The first command is to dump a log file by name – but you may have to fiddle around a bit to find the names of an archived log file because if you choose this option you need to know the sequence number (sequence# in v$log_history) of the file if you want to restrict the size of the dump. The second command simply dumps redo for (in this example) an SCN range – and it’s easy to query v$log_history to find dates, times, and SCN ranges – Oracle will work out for itself which file it has to access. In both cases I’ve restricted the dump to just those redo records that contain change vectors of type BRR (block read records) which is what the layer 23 opcode 2 line is about.
Here’s an example of a redo record that contains nothing but a single BRR. (It’s from a single-block read, a multi-block read would produce a redo record with multiple change vectors, one vector for each block read.)
REDO RECORD - Thread:1 RBA: 0x000add.00000019.01b0 LEN: 0x004c VLD: 0x10 CON_UID: 2846920952 SCN: 0x00000000025a7c13 SUBSCN: 1 05/23/2019 10:42:51 CHANGE #1 CON_ID:3 TYP:2 CLS:6 AFN:9 DBA:0x00407930 OBJ:40 SCN:0x00000000001a1e2a SEQ:2 OP:23.2 ENC:0 RBL:0 FLG:0x0000 Block Read - afn: 9 rdba: 0x00407930 BFT:(1024,4225328) non-BFT:(1,31024) scn: 0x00000000001a1e2a seq: 0x02 flags: 0x00000006 ( dlog ckval ) where: qeilwh05: qeilbk
If you’re wondering about the two interpretations of the rdba (relative datablock address), one is for BFTs (big file tablespaces) and one for non-BFTs. The other thing you’ll notice about the interpretations is that neither file number (1024 or 1) matches the afn (absolute file number). In smaller, non-CDB databases you will probably find that the afn matches the file number in the non-BFT interpretation, but I happen to be testing on a PDB and the first file in my SYSTEM tablespace happens to be the 9th file created in the CDB – connecting as SYS in my PDB I can compare the absolute and “relative” file number very easily:
SQL> select file#, rfile#, name from v$datafile; FILE# RFILE# NAME ---------- ---------- ---------------------------------------------------------------- 9 1 /u01/app/oracle/oradata/orcl12c/orcl/system01.dbf 10 4 /u01/app/oracle/oradata/orcl12c/orcl/sysaux01.dbf 11 9 /u01/app/oracle/oradata/orcl12c/orcl/undotbs01.dbf 12 12 /u01/app/oracle/oradata/orcl12c/orcl/users01.dbf 13 13 /u01/app/oracle/oradata/orcl12c/orcl/APEX_1991375173370654.dbf 14 14 /u01/app/oracle/oradata/orcl12c/orcl/APEX_1993195660370985.dbf 22 22 /u01/app/oracle/oradata/orcl12c/orcl/test_8k_assm.dbf 23 23 /u01/app/oracle/oradata/orcl12c/orcl/test_8k.dbf 8 rows selected.
For bigfile tablespaces the “relative” file number is a complete fake and simply reports 1024 – you’re only allowed one file in a bigfile tablespace, so there is no “relativity” involved. (Unless you’re working at CERN and storing data about particle collisions in the LHC.)
The key point to remember when reading BRRs then, is that you should take the file number from the afn and the block number from the (appropriate) interpretation of the rdba. For the example above I would issue: “alter database dump datafile 9 block 31024;”
Finally
The originator of the thread hasn’t yet made any public response to the suggestion of dumping and reviewing blocks – possibly they’ve started a private conversation with Stefan Koehler who had suggested a strategy that examined function calls rather than block contents – so we’re unable to do any further analysis on what’s going on behind the scenes.
What we would be looking for is any indication that Oracle is repeatedly re-reading the same bitmap blocks and the same data blocks (by a simple check of block addresses); and if that is the case we would want to get some clue about why that might be happening by examining the contents of the data blocks that are subject to repeated reads without changing their status in the bitmap from “space available” to “full”. As it is we just have to wait for the OP to tell us if they’ve made any further progress.