I’ve written a couple of articles in the past about the problems of ASSM spending a lot of time trying to find blocks with usable free space. Without doing a bit of rocket science with some x$ objects, or O/S tracing for the relevant calls, or enabling a couple of nasty events, it’s not easy proving that ASSM might be a significant factor in a performance problem – until you get to 12c Release 2 where a staggering number of related statistics appear in v$sysstat.
I’ve published the full list of statistics (without explanation) at the end of this note, but here’s just a short extract showing the changes in my session’s ASSM stats due to a little PL/SQL loop inserting 10,000 rows, one row at a time into an empty table with a single index:
Name Value ---- ----- ASSM gsp:get free block 185 ASSM cbk:blocks examined 185 ASSM gsp:L1 bitmaps examined 187 ASSM gsp:L2 bitmaps examined 2 ASSM gsp:Search hint 2 ASSM gsp:good hint 185
It looks like we’ve checked a couple of “level 2” bitmap blocks (one for the table, one for the index, presumably) to pick a sequence of “level 1” bitmap blocks that have been very good at taking us to a suitable data (table or index) block that can be used.
You might have expected to see numbers more like 10,000 in the output, but remember that PL/SQL has lots of little optimisations built into it and one of those is that it pins a few blocks while the anonymous block is running so it doesn’t have to keep finding blocks for every single row.
In comparison here’s the effect of the same data load when operated as 10,000 separate insert statements called from SQL*Plus:
Name Value ---- ----- ASSM gsp:get free block 10,019 ASSM cbk:blocks examined 10,220 ASSM cbk:blocks marked full 201 ASSM gsp:L1 bitmaps examined 10,029 ASSM gsp:L2 bitmaps examined 6 ASSM gsp:L2 bitmap full 1 ASSM gsp:Search all 1 ASSM gsp:Search hint 2 ASSM gsp:Search steal 1 ASSM gsp:bump HWM 1 ASSM gsp:good hint 10,016 ASSM rsv:fill reserve 1
It’s interesting to note that in this case we see (I assume) a few cases where we’ve done the check for an L1 bitmap block, gone to a data blocks that was apparently free, and discovered that our insert would make to over full – hence the 201 “blocks marked full”.
Critically, of course, this is just another of the many little indications of how “client/server” chatter introduces lots of little bits of extra work when compared to the “Thick DB “ approach.
One final set of figures. Going back to an example that first alerted me to the type of performance catastrophes that ASSM could contribute to, I re-ran my test case on 12.2 and checked the ASSM figures reported. The problem was that a switch from a 4KB or 8KB blocks size to a 16KB bblock size produced a performance disaster. A version of my test case and some timing results are available on Greg Rahn’s site.
In my test case I have 830,000 rows and do an update that sets column2 to column1 changing it from null to an 8-digit value. With a 16KB block size and PCTFREE set to a highly inappropriate value (in this case the default value of 10) this is what the new ASSM statistics looks like:
Name Value ---- ----- ASSM gsp:get free block 668,761 ASSM cbk:blocks examined 671,404 ASSM cbk:blocks marked full 2,643 ASSM gsp:L1 bitmaps examined 1,338,185 ASSM gsp:L2 bitmaps examined 672,413 ASSM gsp:Search all 332 ASSM gsp:Search hint 668,760 ASSM gsp:Search steal 332 ASSM gsp:bump HWM 332 ASSM wasted db state change 669,395
I’d love to know what the figures would have looked like if they had been available in the original Oracle 9.2.0.8 case (my guess is that the “blocks examined” statistic would have been in the order of hundreds of millions); they look fairly harmless in this case even though the database (according to some of the other instance activity stats) did roughly 10 times the work you might expect from a perfect configuration.
Even here, though, where the original catastrophic bug has been addressed, the ASSM stats give you an important clue: we’ve been doing a simple update so why have we even been looking for free space (get free block); even stranger, how come we had to examine 1.3M L1 bitmaps when we’ve only updated 830,000 rows surely the worst case scenario shouldn’t have been worse that 1 to 1; and then there’s that “wasted db state change” – I don’t understand exactly what that last statistic is telling me but when I’m worried about performance I tend to worry about anything that’s being wasted.
In passing – if you want to insert a single row into an unindexed table you can expect Oracle to examine the segment header, then an L2 bitmap block, then an L1 bitmap block to find a data block for the insert. (In rare cases that might be segment header, L3, L2, L1). There are then optimisation strategies for pinning blocks – the session will pin the L1 bitmap block briefly because it may have to check several data blocks it references because they may be full even though they are flagged as having space; similarly the session will pin the L2 bitmap block because it may need to mark an L1 bitmap block as full and check another L1 block. The latter mechanism probably explains why we have examined more L1 bitmaps than L2 bitmaps.
Finally, the full monty
Just a list of all the instance statistics that start with “ASSM”:
ASSM bg: segment fix monitor ASSM bg:create segment fix task ASSM bg:mark segment for fix ASSM bg:slave compress block ASSM bg:slave fix one segment ASSM bg:slave fix state ASSM cbk:blocks accepted ASSM cbk:blocks examined ASSM cbk:blocks marked full ASSM cbk:blocks rejected ASSM fg: submit segment fix task ASSM gsp:Alignment unavailable in space res ASSM gsp:L1 bitmaps examined ASSM gsp:L2 bitmap full ASSM gsp:L2 bitmaps examined ASSM gsp:L3 bitmaps examined ASSM gsp:Optimized data block rejects ASSM gsp:Optimized index block rejects ASSM gsp:Optimized reject DB ASSM gsp:Optimized reject l1 ASSM gsp:Optimized reject l2 ASSM gsp:Search all ASSM gsp:Search hint ASSM gsp:Search steal ASSM gsp:add extent ASSM gsp:blocks provided by space reservation ASSM gsp:blocks rejected by access layer callback ASSM gsp:blocks requested by space reservation ASSM gsp:bump HWM ASSM gsp:get free block ASSM gsp:get free critical block ASSM gsp:get free data block ASSM gsp:get free index block ASSM gsp:get free lob block ASSM gsp:good hint ASSM gsp:reject L1 ASSM gsp:reject L2 ASSM gsp:reject L3 ASSM gsp:reject db ASSM gsp:space reservation success ASSM gsp:use space reservation ASSM rsv:alloc from reserve ASSM rsv:alloc from reserve fail ASSM rsv:alloc from reserve succ ASSM rsv:clear reserve ASSM rsv:fill reserve ASSM wasted db state change
Hi Jonathan,
we are seeing an intermittent issue with somewhat similar symptoms. Single-row inserts into one of the tables occasionally spend up to 2 minutes. During that time, the are doing large amounts (up to 100k) of single-block reads from a unique index on that table.
The tricky part is that this table lives in an MSSM tablespace.
Do you think the basic mechanism behind it could still be the same? The database version is 11.2.0.4.
Thanks!
Best regards,
Nikolai
Comment by Nikolai Savvinov — January 30, 2017 @ 12:54 pm GMT Jan 30,2017 |
Nikolai,
My first thought would be about a very old problem with leaf node splits the wrong free block from an index, but there could be the ASSM variation of the problem (not that that’s your case).
Option 1: Do you have any code which does very large deletes leaving lots of index blocks empty; do you have any code that could insert lots of (consecutive PK values) in a batch and then rolls back ? In either case you could have a lot of empty blocks which Oracle decides to use as the next block for re-use in an index leaf block split, but finds (after doing a lot of the work) that for some reason it can’t use that block, and tries many more blocks one after the other. This shouldn’t happen any more, but maybe it can – symptoms: do you see (a very small number of ) “recursive aborts on index block reclamation” in the instance statistics; do lots of other session wait on TX mode 4 while the delay is going on.
Option 2: A special ASSM case – after a very large array insert of consecutive key values you might have a lot of leaf blocks which are full but marked as being free by the bitmap block. This used to be possible with tables, I’ve not seen it with indexes, but if it happens then when Oracle needs an empty block is might end up walking the bitmap and finding that the blocks were full. This, I think, is where 12c2 would should you “blocks marked full” climbing unexpectedly. Pre-12c I would check the blocks being read to see if they were leaf blocks or bitmap blocks. A treedump could also give a clue if a couple of those blocks were in an area of the index with a lot of blocks showing nrows = 0, rrow=0
might also be revealing.
UPDATE:
I’ve found a blog note that I wrote about the 1st problem here: https://jonathanlewis.wordpress.com/2008/09/29/index-analysis-2/
And a variant on the problem here: https://jonathanlewis.wordpress.com/2009/11/10/index-freelists/
Comment by Jonathan Lewis — January 30, 2017 @ 1:19 pm GMT Jan 30,2017 |
Hi Jonathan,
thanks a lot for your detailed reply. The second link mentions a blog article by Saibabu that provides a reproducible case that fits our scenario (MSSM, and the waits are on index and not the table). I’m going to experiment a little bit with it.
To answer your question about deletes/rollbacks, yes, both are possible, although it’s hard to tell for sure whether either actually took place when the problem was last observed (which is over 2 weeks ago). I plotted “recursive aborts on index block reclamation” to see if we had any spikes around that time, but found none. I also did the same for several different split statistics, with the same result. The only statistics that could be relevant that were appreciably increased within the right time frame were ones having to do with segment chunks allocation and segment preallocation, but I’m not quite sure what to do with it.
We deployed a script which turns on SQL tracing if the given INSERT statement takes more than 7 seconds, hopefully that would give us some further insights.
Thanks again for your help.
Best regards,
Nikolai
Comment by Nikolai Savvinov — January 31, 2017 @ 9:53 am GMT Jan 31,2017 |
Nikolai,
I haven’t worked with this problem recently so things may have changed, but the “recursive abort” statistic didn’t change by much – a single problem event might result in several hundred blocks being manipulated and then discarded (which, by the way, would also generate a spike in redo – roughly 18KB * number of blocks) but it would only count as ONE recursive abort.
Comment by Jonathan Lewis — February 1, 2017 @ 10:21 am GMT Feb 1,2017
Hi Jonathan,
thanks — that agrees with my findings that I shared in my blog yesterday (the entire problem being produced by a single event, e.g. by a single 90-10 split).
I wonder if it’s possible to detect the spike in redo generation. Probably not with AWR data (too much redo in a 30 min snapshot anyway), but by looking at DELTA…IO… columns in ASH (for the LGWR session). Will try it if I find time.
Best regards,
Nikolai
Comment by savvinov — February 1, 2017 @ 10:28 am GMT Feb 1,2017
[…] https://jonathanlewis.wordpress.com/2017/01/30/assm-help/ , […]
Pingback by Long-running INSERT – Oracle Diagnostician — January 31, 2017 @ 8:51 pm GMT Jan 31,2017 |