A few days ago I was rehearsing a presentation about how to investigate how Oracle works, and came across something surprising. Here’s a simple bit of code:
rem rem Script: expert.sql rem Author: Jonathan Lewis rem Dated: Mar 2010 rem create table t1 as with generator as ( select --+ materialize rownum id from dual connect by rownum <= 1e4 -- > comment to avoid wordpress format issue ) select rownum id, lpad(rownum,10,'0') small_vc, case mod(rownum,1000) when 0 then 'Y' else 'N' end flag from generator v1, generator v2 where rownum <= 1e6 -- > comment to avoid wordpress format issue ; -- gather table stats here execute snap_redo.start_snap execute snap_my_stats.start_snap update /*+ full(t1) */ t1 set flag = 'N' where flag = 'Y' ; execute snap_my_stats.end_snap execute snap_redo.end_snap
The two “snap” packages take a starting snapshot of v$mystat/v$statname and v$sysstat (the latter restricted to redo statistics only) and an ending snapshot, and report the change between snapshots. Here are the results for just the redo statistics from the update when I ran the code on 10.2.0.3:
Name Value ---- ----- redo entries 1,907 redo size 439,260 redo ordering marks 17 redo subscn max counts 18
You may at this point be wondering why the first two figures are roughly twice the size they ought to be; you might, of course, be wondering what you’re supposed to be seeing here (in which case you need to see my presentation “The Beginners’ Guide to becoming an Expert.”)
My code updates one row in every thousand and each row is in a separate block. That means I should see roughly one redo entry per row at roughly 250 bytes per redo entry. Since I’m using Oracle 10g (single instance) the total number of entries ought to be slightly less than 1,000 because the first 40 or 50 should be accumulated in the private redo and in-memory undo areas. So we have about 950 redo entries too many; where did they come from ?
The test takes about 60 seconds to run, so I repeated it taking snapshots of a couple more dynamic performance views – which didn’t actually help very much – and then ran it one more time after switching log files so that I could dump the log file and find the relevant redo entries easily after the update. I won’t bore you with an extract from the log file dump, I’ll just point out that it showed me two redo records per row – op code 11.4 followed by op code 11.5, which translate to: “lock row” and “update row” respectively – when it should simply have been showing nothing but “update row” records.
It took me a little while to figure out what was going on but the title of this piece probably gives it away. A few days earlier I had enabled audit (audit_trail = db_extended) on my laptop to demonstrate a point to a client – the redundant “lock row” records appeared as a side effect (which I hadn’t noticed at the client site because we were focussing on a completely different point – which I might write about next week).
Here’s the really important point – I had NOT enabled any audit options on this table, the only thing I had done was make it possible for Oracle to record audit records. The spare redo entries were NOT redo entries relating to the aud$ table; instead, the simple act of enabling audit introduced a completely spurious “lock row” record on every update I did.
So I changed audit_trail to ‘none’, bounced the database and repeated the test, and got back to the result that I had been expecting:
Name Value ---- ----- redo entries 961 redo size 295,996 redo ordering marks 13 redo subscn max counts 14
As you can see, we’re back to 1,000 redo records (minus the odd 40 or 50) at about 250 bytes per entry (allowing for the size of the first “private” redo record that combined lots of changes).
Note: Having observed this problem on 10.2.0.3 with audit_trail = db_extended I repeated the tests with audit_trail=db, and then tested a couple of older and newer versions of Oracle. The phenomenon appeared only on 10.2.0.3, and it didn’t matter whether I chose db or db_extended.
Footnote: the day after I made this discovery I gave what must have been my worst ever presentation: not only had I lost three hours sleep working out the details of this problem, I’d made a couple of “quick changes” to some of my demo scripts to do some extra checks and I had failed to clean up properly afterwards so I kept getting the “wrong” results and confusing myself in my live demonstrations.
Update 23/8/2011.
I’ve just fallen into the same trap at the NoCOUG conference. Since I’ve highlighted this note I thought I’d add a copy of one pair of redo records to show how I recognised what was going on.
CHANGE #4 TYP:0 CLS: 1 AFN:5 DBA:0x01800010 OBJ:88040 SCN:0x0000.04a7f3bd SEQ: 2 OP:11.4 KTB Redo op: 0x01 ver: 0x01 op: F xid: 0x0006.001.000052b0 uba: 0x00800844.0cad.1f KDO Op code: LKR row dependencies Disabled xtype: XA flags: 0x00000000 bdba: 0x01800010 hdba: 0x01800009 itli: 2 ispac: 0 maxfr: 4863 tabn: 0 slot: 21 to: 2 CHANGE #5 TYP:0 CLS: 1 AFN:5 DBA:0x01800010 OBJ:88040 SCN:0x0000.04a7f476 SEQ: 1 OP:11.5 KTB Redo op: 0x02 ver: 0x01 op: C uba: 0x00800844.0cad.20 KDO Op code: URP row dependencies Disabled xtype: XAxtype KDO_KDOM2 flags: 0x00000080 bdba: 0x01800010 hdba: 0x01800009 itli: 2 ispac: 0 maxfr: 4863 tabn: 0 slot: 21(0x15) flag: 0x2c lock: 2 ckix: 0 ncol: 3 nnew: 1 size: 0 Vector content: col 2: [ 1] 4e
Note particularly OP:11.4 is an LKR record (line 5), and OP:11.5 is a URP record (line 14) … viz: Lock Row, Update Row Piece.
Hi Jonathan,
>> Footnote […] I gave what must have been my worst ever presentation
No, don’t worry, it was a good presentation.
You proved that it is not so easy to match expected results with experimental observations, because there are so many parameters that can change the behaviour of such a complex system as Oracle is. But you have shown as well that an expert will not give up and look at the system as a black box. That he still can try to understand what happens, even if it takes a few hours…
Regards,
Franck.
Comment by Franck Pachot — May 29, 2011 @ 9:08 am BST May 29,2011 |
Franck,
Thanks for the comment.
I hope the rest of the audience viewed the presentation the same way. I have to say, when things “go wrong” I prefer it when I’ve planned for them to go wrong so that I don’t have to backtrack and contradict myself. Still, with hindsight I can say the audience did see the most important bit of the “science of Oracle”, the bit where someone says: “wait a minute – that can’t be right.”
Comment by Jonathan Lewis — June 2, 2011 @ 8:41 pm BST Jun 2,2011 |
Bug 5166745 Excessive redo generation with auditing enabled
Versions confirmed as being affected
* 10.2.0.1
* 10.2.0.2
* 10.2.0.3
This issue is fixed in
* 10.2.0.4 (Server Patch Set)
* 11.1.0.6 (Base Release)
This problem is introduced in 10g R2 (10.2.0.1).
Excessive redo generation occurs in 10.2 with standard auditing enabled
(around 15%-20% increase compared to 10.1/9.2)
Comment by Anonymous — June 3, 2011 @ 11:12 am BST Jun 3,2011 |
Thanks for the information. It’s nice to know that it’s so localised.
I noticed the document had a summary of the redo OPs – do you know if there’s a quick way to generate this summary. At present I just dump the log file and then grep my way through it – it would be convenient if there were a dump option that dumped just the one necessary line. (I really don’t want to have to write a C program to unravel the file, given the infrequency of the occasions when I need to do that type of analysis).
Comment by Jonathan Lewis — June 4, 2011 @ 10:29 am BST Jun 4,2011 |
[…] by session was a light-weight operation, think again. Interestingly, the auditing anomaly that I reported recently re-appears in the “audit by session” option. You will recall that when auditing is […]
Pingback by Audit Excess « Oracle Scratchpad — June 7, 2011 @ 6:20 pm BST Jun 7,2011 |
[…] anomaly may look familiar – but this time it’s nothing to do with the audit_trail, it just seems to be the way […]
Pingback by Correlation oddity « Oracle Scratchpad — December 19, 2011 @ 6:26 pm GMT Dec 19,2011 |