IOT | Oracle Scratchpad

December 16, 2019

IOT Bug

Filed under: Bugs,IOT,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 3:58 pm GMT Dec 16,2019

Bug fixed by 19.11.0.0 – but new defect found (jump to update)

Here’s a worrying bug that showed up a couple of days ago on the Oracle-L mailing list. It’s a problem that I’ve tested against 12.2.0.1 and 19.3.0.0 – it may be present on earlier versions of Oracle. One of the nastiest things about it is that you might not notice it until you get an “out of space” error from the operating system. You won’t get any wrong results from it, but it may well be adding an undesirable performance overhead.

Basically it seems that (under some circumstances, at least) Oracle is setting the “block guess” component of the secondary index on Index Organized Tables (IOTs) to point to blocks in the overflow segment instead of blocks in the primary key segment. As a result, when you execute a query that accesses the IOT through the secondary index and has to do reads from disc to satisfy the query – your session goes through the following steps:

Identify index entry from secondary index – acquire “block guess”
Read indicated block and discover the object number on the block is wrong, and the block type is wrong
Write a (silent) ORA-01410 error and do a block dump into the trace file
Use the “logical rowid” from the secondary index (i.e. the stored primary key value) to access the primary key index by key value

So your query runs to completion and you get the right result because Oracle eventually gets there using the primary key component stored in the secondary index, but it always starts with the guess^{[see footnote]} and for every block you read into the cache because of the guess you get a dump to the trace file.

Here’s a little code to demonstrate. The problem with this code is that everything appears to works perfectly, you have to be able to find the trace file for your session to see what’s gone wrong. First we create some data – this code is largely copied from the original posting on Oracle-L, with a few minor changes:


rem
rem     Script:         iot_bug_12c.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem
rem     Notes
rem     The original author had tested on 19.5.0.0 to get the same effect, see:
rem     //www.freelists.org/post/oracle-l/IOT-cannot-get-valid-consensus-bug-or-unexplained-behavior
rem

drop table randomload purge;

create table randomload(
        roll number,
        name varchar2(40),
        mark1 number,
        mark2 number,
        mark3 number,
        mark4 number,
        mark5 number,
        mark6 number,
        primary key (roll)
) 
organization index 
including mark3 overflow
;

create index randomload_idx on randomload(mark6);

insert into randomload 
select 
        rownum, 
        dbms_random.string(0,40) name, 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,100)), 
        round(dbms_random.value(0,10000)) 
from 
        dual 
connect by 
        level < 1e5 -- > comment to avoid wordpress format issue
;

commit;

exec dbms_stats.gather_table_stats(null,'randomload', cascade=>true);

prompt  ==================================================
prompt  pct_direct_access should be 100 for randomload_idx
prompt  ==================================================

select 
        table_name, index_name, num_rows, pct_direct_access, iot_redundant_pkey_elim  
from 
        user_indexes
where
        table_name = 'RANDOMLOAD'
;

It should take just a few seconds to build the data set and you should check that the pct_direct_access is 100 for the index called randomload_idx.

The next step is to run a query that will do an index range scan on the secondary index.

 
column mark6 new_value m_6

select 
        mark6, count(*) 
from
        randomload 
group by 
        mark6
order by 
        count(*)
fetch first 5 rows only
;

alter system flush buffer_cache;
alter session set events '10046 trace name context forever, level 8';
set serveroutput off

select avg(mark3) 
from 
        randomload 
where 
        mark6 = &m_6
;

select * from table(dbms_xplan.display_cursor);

alter session set events '10046 trace name context off';
set serveroutput on

I’ve started by selecting one of the least frequently occurring values of m_6 (a column I know to be in the overflow); then I’ve flushed the buffer cache so that any access I make to the data will have to start with disk reads (the original poster suggested restarting the database at this point, but that’s not necessary).

Then I’ve enabled sql_trace to show wait states (to capture details of what blocks were read and which object they belong to), and I’ve run a query for m_3 (a column that is in the primary key (TOP) segment of the IOT) and pulled its execution plan from memory to check that the query did use a range scan of the secondary index. Here’s the plan:

----------------------------------------------------------------------------------------
| Id  | Operation          | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |                   |       |       |    11 (100)|          |
|   1 |  SORT AGGREGATE    |                   |     1 |     7 |            |          |
|*  2 |   INDEX UNIQUE SCAN| SYS_IOT_TOP_77298 |    10 |    70 |    11   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN| RANDOMLOAD_IDX    |    10 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("MARK6"=1316)
   3 - access("MARK6"=1316)

As you can see the plan shows what we are hoping to see – an index range scan of the secondary index that let’s it follow up with a unique scan of the primary key segment. It’s just a little odd that the access predicate reported for operation 2 (unique scan of IOT_TOP) suggests that the access is on a column that isn’t in the primary key and isn’t even in the IOT_TOP segment.

So the query works and gives the right answer. But what do we find in the trace directory? If you’re running 12c (possibly only relevant to 12.2), each time the error occurs the following pattern of information will be written to the alert log (it didn’t appear in 19.3)


ORCL(3):Hex dump of (file 22, block 16747) in trace file /u01/app/oracle/diag/rdbms/orcl12c/orcl12c/trace/orcl12c_ora_7888.trc
ORCL(3):
ORCL(3):Corrupt block relative dba: 0x0580416b (file 22, block 16747)
ORCL(3):Bad header found during multiblock buffer read (logical check)
ORCL(3):Data in bad block:
ORCL(3): type: 6 format: 2 rdba: 0x0580416b
ORCL(3): last change scn: 0x0000.0b86.0e36484c seq: 0x1 flg: 0x06
ORCL(3): spare3: 0x0
ORCL(3): consistency value in tail: 0x484c0601
ORCL(3): check value in block header: 0x4408
ORCL(3): computed block checksum: 0x0
ORCL(3):

And the following pattern of information is written to the trace file [Update: a follow-up test on 11.2.0.4 suggests that the basic “wrong block address” error also happens in that version of Oracle, but doesn’t result in a dump to the trace file]:


kcbzibmlt:: encounter logical error ORA-1410, try re-reading from other mirror..
cursor valid? 1 makecr 0 line 15461 ds_blk (22, 16747) bh_blk (22, 16747)
kcbds 0x7ff1ca8c0b30: pdb 3, tsn 8, rdba 0x0580416b, afn 22, objd 135348, cls 1, tidflg 0x8 0x80 0x0
    dsflg 0x108000, dsflg2 0x0, lobid 0x0:0, cnt 0, addr 0x0, exf 0x10a60af0, dx 0x0, ctx 0
    whr: 'qeilwh03: qeilbk'
env [0x7ff1ca8e3e54]: (scn: 0x00000b860e364893   xid: 0x0000.000.00000000  uba: 0x00000000.0000.00  statement num=0  parent xid:  0x0000.000.00000000  st-scn: 0x0000000000000000  hi-scn: 0x0000000000000000  ma-scn: 0x00000b860e364879  flg: 0x00000660)
BH (0xb1fd6278) file#: 22 rdba: 0x0580416b (22/16747) class: 1 ba: 0xb1c34000
  set: 10 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 763,14
  dbwrid: 0 obj: 135348 objn: 135348 tsn: [3/8] afn: 22 hint: f
  hash: [0x9eff0528,0x77cff808] lru: [0xb1fd2578,0x9ff84658]
  ckptq: [NULL] fileq: [NULL]
  objq: [0xb6f654c0,0x9ff84680] objaq: [0xb6f654d0,0x9ff84690]
  use: [0x77b78128,0x77b78128] wait: [NULL]
  st: READING md: EXCL tch: 0
  flags: only_sequential_access
  Printing buffer operation history (latest change first):
  cnt: 5
  01. sid:10 L122:zgb:set:st          02. sid:10 L830:olq1:clr:WRT+CKT
  03. sid:10 L951:zgb:lnk:objq        04. sid:10 L372:zgb:set:MEXCL
  05. sid:10 L123:zgb:no:FEN          06. sid:10 L083:zgb:ent:fn
  07. sid:08 L192:kcbbic2:bic:FBD     08. sid:08 L191:kcbbic2:bic:FBW
  09. sid:08 L604:bic2:bis:REU        10. sid:08 L190:kcbbic2:bic:FAW
  11. sid:08 L602:bic1_int:bis:FWC    12. sid:08 L822:bic1_int:ent:rtn
  13. sid:08 L832:oswmqbg1:clr:WRT    14. sid:08 L930:kubc:sw:mq
  15. sid:08 L913:bxsv:sw:objq        16. sid:08 L608:bxsv:bis:FBW
Hex dump of (file 22, block 16747)

   ... etc.

Corrupt block relative dba: 0x0580416b (file 22, block 16747)
Bad header found during multiblock buffer read (logical check)
Data in bad block:
 type: 6 format: 2 rdba: 0x0580416b
 last change scn: 0x0000.0b86.0e36484c seq: 0x1 flg: 0x06
 spare3: 0x0
 consistency value in tail: 0x484c0601
 check value in block header: 0x4408
 computed block checksum: 0x0
TRCMIR:kcf_reread     :start:  16747:0:/u01/app/oracle/oradata/orcl12c/orcl/test_8k_assm.dbf
TRCMIR:kcf_reread     :done :  16747:0:/u01/app/oracle/oradata/orcl12c/orcl/test_8k_assm.dbf

The threat, of course, is the bit I’ve removed and replaced with just “etc.”: it’s a complete block dump (raw and symbolic) which in my example was somthing like 500 lines and 35KB in size.

It’s not immediately obvious exactly what’s going on and why but the 10046 trace helps a little. From another run of the test (on 19.3.0.0) I got the following combination of details – which is an extract showing the bit of the wait state trace leading into the start of the first block dump:

WAIT #140478118667016: nam='db file scattered read' ela= 108 file#=13 block#=256 blocks=32 obj#=77313 tim=103574529210
WAIT #140478118667016: nam='db file scattered read' ela= 2236 file#=13 block#=640 blocks=32 obj#=77313 tim=103574531549
WAIT #140478118667016: nam='db file scattered read' ela= 534 file#=13 block#=212 blocks=32 obj#=77312 tim=103574532257
kcbzibmlt: encounter logical error ORA-1410, try re-reading from other mirror..
cursor valid? 1 warm_up abort 0 makecr 0 line 16082 ds_blk (13, 212) bh_blk (13, 212)

Object 77313 is the secondary index, object 77312 is the primary key index (IOT_TOP). It may seem a little odd that Oracle is using db file scattered reads of 32 blocks to read the indexes but this is a side effect of flushing the buffer – Oracle may decide to pre-fetch many extra blocks of an object to “warmup” the cache (statistic “physical reads prefetch warmup”) just after an instance startup or a flush of the buffer cache. The thing I want to check, though, is what’s wrong with the blocks that Oracle read from object 77312:


alter system dump datafile 13 block min 212 block max 243;

BH (0xc8f68e68) file#: 13 rdba: 0x034000d4 (13/212) class: 1 ba: 0xc8266000
  set: 10 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77311 objn: 77311 tsn: [3/6] afn: 13 hint: f

BH (0xa7fd6c38) file#: 13 rdba: 0x034000d4 (13/212) class: 1 ba: 0xa7c2a000
  set: 12 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77311 objn: 77311 tsn: [3/6] afn: 13 hint: f

BH (0xa5f75780) file#: 13 rdba: 0x034000d5 (13/213) class: 0 ba: 0xa5384000
  set: 11 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77311 objn: 77311 tsn: [3/6] afn: 13 hint: f

BH (0xdafe9220) file#: 13 rdba: 0x034000d5 (13/213) class: 1 ba: 0xdadcc000
  set: 9 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77311 objn: 77311 tsn: [3/6] afn: 13 hint: f

...

I’ve reported the first few lines of the symbolic dump for the first few blocks of the resulting trace file. Look at the third line of each group of BH lines: it’s reporting object 77311 (the overflow segment), not 77312 (the IOT_TOP segment). And every single block reported in the db file scattered read of 32 blocks for object 77312 reports itself, when dumped, as being part of object 77311. And that’s possibly the immediate cause of the ORA-01410.

We can take the investigation a little further by dumping a leaf block or two from the secondary index.


alter session set events 'immediate trace name treedump level 77313';

----- begin tree dump
branch: 0x3400104 54526212 (0: nrow: 542, level: 1)
   leaf: 0x340010d 54526221 (-1: row:278.278 avs:2479)
   leaf: 0x340075e 54527838 (0: row:132.132 avs:5372)
   leaf: 0x34005fb 54527483 (1: row:41.41 avs:7185)

alter system dump datafile 13 block 1886   -- leaf: 0x340075e

BH (0xd5f5d090) file#: 13 rdba: 0x0340075e (13/1886) class: 1 ba: 0xd5158000
  set: 9 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,15
  dbwrid: 0 obj: 77313 objn: 77313 tsn: [3/6] afn: 13 hint: f
...
row#6[5796] flag: K------, lock: 0, len=18
col 0; len 2; (2):  c1 1d
col 1; len 4; (4):  c3 07 41 5c
tl: 8 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 4]  03 40 05 7c

I’ve done a treedump of the secondary index and picked a leaf block address from the treedump and dumped that leaf block, and from that leaf block I’ve extracted one index entry to show you the three components: the key value (c1 1d), the primary key for the row (c3 07 41 5c), and the block guess (03 40 05 75). Read the block guess as a 4 byte hex number, and it translates to file 13, block 1397 – which should belong to the IOT_TOP segment. So the exciting question is – what object does block (13, 1397) think it belongs to ?


alter system dump datafile 13 block 1397;

Block header dump:  0x03400575
 Object id on Block? Y
 seg/obj: 0x12dff  csc:  0x00000b860e308c46  itc: 2  flg: E  typ: 1 - DATA
     brn: 0  bdba: 0x3400501 ver: 0x01 opc: 0
     inc: 0  exflg: 0

Converting from Hex to Decimal: obj: 0x12dff = 77311 which is the overflow segment. The secondary index block guess is pointing at a block in the overflow segment.

There are two ways to handle this problem – you could simply rebuild the index (alter index rebuild) or, as the original poster did, use the “update block references” command to correct all the block guesses: “alter index randomload_idx update block references;”. Neither is desirable, but if you’re seeing a lot of large trace files following the pattern above then it may be necessary.

There was one particular inconsistency in the tests – which I ran many times – occasionally the pct_direct_access for the secondary index would be reported as zero (which, technically, should always happen given the error). If it did, of course, Oracle wouldn’t follow the guess but would go straight to the step where it used the primary key “logical rowid” – thus bypassing the error and block dump.

tl;dr

In some circumstances the block guesses in the secondary indexes of IOTs may be pointing to the overflow segment instead of the primary key (TOP) segment. If this happens then queries will still run and give the right answers, but whenever they read a “guessed” block from disc they will report an ORA-01410 error and dump a block trace. This will affect performance and may cause space problems at the O/S level.

Footnote

An entry in the secondary index of an Index Organized Table (IOT) consists of three parts, which intially we can think in the form:

({key-value}, {logical rowid}, {block guess})

Since IOTs don’t have real rowids the “logical rowid” is actually the primary key of the row where the {key value} will be found. As a short cut for efficient execution Oracle includes the block address (4 bytes) where that primary key value was stored when the row was inserted. Because an IOT is an index “rows” in the IOT can move as new data is inserted and leaf blocks split, so eventually any primary key may move to a different block – this is why we refer to the block address as a guess – a few days, hours, or minutes after you’ve inserted the row the block address may no longer be correct.)

To help the runtime engine do the right thing Oracle collects a statistic called pct_direct_access for secondary indexes of IOTs. This is a measure of what percentage of the block guesses are still correct at the time that the statistics are gathered. If this value is high enough the run-time engine will choose to try using the block guesses while executing a query (falling back to using the logical rowid if it turns out that the guess is invalid), but if the value drops too low the optimizer will ignore the block guesses and only use the logical rowid.

Not relevant to this note – but a final point about secondary indexes and logical rowids – if the definition of the index includes some of the columns from the primary keys Oracle won’t store those columns twice (in more recent version, that is) – the code is clever enough to use the values stored in the (key value) component when it needs to use the (logical rowid) component.

Update (Jan 2020)

I passed this example on to Oracle, and there are now two (non-visible) bugs recorded for it:

Bug 30733525 – ALERT LOG ENTRIES RE BLOCK GUESSES IN THE SECONDARY INDEXES OF IOTS POINTING TO OVERFLOW SEGMENT INSTEAD OF INDEX SEGMENT
Bug 30733563 – WRONG GUESS DBA IN INDEX

Update (Nov 2022)

Checking the alert log immediately after starting up an instance of 19.11.0.0 I can see in the section (Dumping current patch information) the bug number 30733563 reported in the list of bugs fixed by Patch Id: 32545013 (the 19.11.0.0 patch).

I’ve also repeated the test using 21.3.0.0 (the most recent patch I had at hand), and the results of the test suggested that it didn’t suffer from the same bug. Unfortunately I noticed another oddity (that hadn’t been present in the 19.3/12.2 run) in the trace file, and checking back to the 19.11.0.0 trace file the oddity was there as well.

Here’s the set of waits due to the query, taken from the two trace files, 19.11 first then 12.3:


WAIT #139944562146024: nam='db file sequential read' ela= 26 file#=36 block#=260 blocks=1 obj#=156326 tim=8771776217
WAIT #139944562146024: nam='db file sequential read' ela= 7 file#=36 block#=358 blocks=1 obj#=156326 tim=8771776320
WAIT #139944562146024: nam='db file sequential read' ela= 6 file#=36 block#=1693 blocks=1 obj#=156325 tim=8771776357
WAIT #139944562146024: nam='db file scattered read' ela= 144 file#=36 block#=1312 blocks=32 obj#=156324 tim=8771780673


WAIT #139623483390184: nam='db file sequential read' ela= 21 file#=13 block#=1915 blocks=1 obj#=87942 tim=1529853216
WAIT #139623483390184: nam='db file sequential read' ela= 9 file#=13 block#=9725 blocks=1 obj#=87942 tim=1529853286
WAIT #139623483390184: nam='db file sequential read' ela= 7 file#=13 block#=11083 blocks=1 obj#=87941 tim=1529853337
WAIT #139623483390184: nam='db file scattered read' ela= 78 file#=13 block#=10736 blocks=8 obj#=87940 tim=1529857981

In both cases the object number (obj#=) shows me:

Two single block reads from the secondary index
One single block read from the primary key (IOT_TOP) index
A multi-block read from the overflow

You’ll notice that Oracle is not doing the multi -block (warmup) reads of the two indexes that it had been doing in the older versions; but the multi-block read of the overflow segment is redundant. The query is for a column which, by virtue of the including clause of the table, is in the IOT_TOP segment, so the code should be capable of avoiding the visit the overflow.

My test case selected just one row from the table, so I thought I’d select for a different value of mark6 to see what happened if I returned more rows scattered randomly through the overflow. In the 21.3 test the predicate “mark6 = 1312” returned 14 rows – here’s the trace file extract for the disk waits (again preceded by a flush of the buffer cache).


WAIT #139727331718368: nam='SQL*Net message to client' ela= 4 driver id=1952673792 #bytes=1 p3=0 obj#=87942 tim=4367518599
WAIT #139727331718368: nam='db file scattered read' ela= 7938 file#=13 block#=9816 blocks=8 obj#=87941 tim=4367526882
WAIT #139727331718368: nam='db file sequential read' ela= 869 file#=13 block#=9709 blocks=1 obj#=87940 tim=4367528218
WAIT #139727331718368: nam='db file sequential read' ela= 25 file#=13 block#=9821 blocks=1 obj#=87941 tim=4367528479
WAIT #139727331718368: nam='db file sequential read' ela= 812 file#=13 block#=9868 blocks=1 obj#=87940 tim=4367529392
WAIT #139727331718368: nam='db file scattered read' ela= 935 file#=13 block#=9848 blocks=8 obj#=87941 tim=4367530560
WAIT #139727331718368: nam='db file scattered read' ela= 725 file#=13 block#=9869 blocks=3 obj#=87940 tim=4367531477
WAIT #139727331718368: nam='db file scattered read' ela= 1153 file#=13 block#=9776 blocks=8 obj#=87941 tim=4367532887
WAIT #139727331718368: nam='db file sequential read' ela= 926 file#=13 block#=9890 blocks=1 obj#=87940 tim=4367534087
WAIT #139727331718368: nam='db file scattered read' ela= 1073 file#=13 block#=9768 blocks=8 obj#=87941 tim=4367535342
WAIT #139727331718368: nam='db file sequential read' ela= 858 file#=13 block#=9900 blocks=1 obj#=87940 tim=4367536501
WAIT #139727331718368: nam='db file scattered read' ela= 5982 file#=13 block#=10232 blocks=8 obj#=87941 tim=4367542650
WAIT #139727331718368: nam='db file sequential read' ela= 1047 file#=13 block#=9914 blocks=1 obj#=87940 tim=4367544052
WAIT #139727331718368: nam='db file scattered read' ela= 1000 file#=13 block#=10144 blocks=8 obj#=87941 tim=4367545254
WAIT #139727331718368: nam='db file sequential read' ela= 722 file#=13 block#=9926 blocks=1 obj#=87940 tim=4367546176
WAIT #139727331718368: nam='db file scattered read' ela= 943 file#=13 block#=10328 blocks=8 obj#=87941 tim=4367547275
WAIT #139727331718368: nam='db file sequential read' ela= 767 file#=13 block#=9937 blocks=1 obj#=87940 tim=4367548341
WAIT #139727331718368: nam='db file scattered read' ela= 973 file#=13 block#=10272 blocks=8 obj#=87941 tim=4367549561
WAIT #139727331718368: nam='db file sequential read' ela= 642 file#=13 block#=9960 blocks=1 obj#=87940 tim=4367550383
WAIT #139727331718368: nam='db file scattered read' ela= 892 file#=13 block#=10280 blocks=8 obj#=87941 tim=4367551481
WAIT #139727331718368: nam='db file scattered read' ela= 753 file#=13 block#=9961 blocks=7 obj#=87940 tim=4367552395
WAIT #139727331718368: nam='db file scattered read' ela= 1271 file#=13 block#=10600 blocks=8 obj#=87941 tim=4367553876
WAIT #139727331718368: nam='db file sequential read' ela= 1145 file#=13 block#=10731 blocks=1 obj#=87940 tim=4367555305
WAIT #139727331718368: nam='db file scattered read' ela= 779 file#=13 block#=10960 blocks=8 obj#=87941 tim=4367556371
WAIT #139727331718368: nam='db file sequential read' ela= 569 file#=13 block#=10714 blocks=1 obj#=87940 tim=4367557135
WAIT #139727331718368: nam='db file scattered read' ela= 730 file#=13 block#=11080 blocks=8 obj#=87941 tim=4367557962
WAIT #139727331718368: nam='db file sequential read' ela= 384 file#=13 block#=10630 blocks=1 obj#=87940 tim=4367558471
WAIT #139727331718368: nam='db file scattered read' ela= 597 file#=13 block#=11128 blocks=8 obj#=87941 tim=4367559150
WAIT #139727331718368: nam='db file sequential read' ela= 373 file#=13 block#=10664 blocks=1 obj#=87940 tim=4367559606

The multiblock prefetches on the IOT_TOP segment (obj# – 87941) aren’t unreasonable – although they weren’t actually recorded as prefetches or warmups. But the 14 single block reads of the overflow segment are most unreasonable.

I’m not going to dig any deeper into this – but at some stage I (or someone) will have to go back to earlier versions of Oracle to see if Oracle has always been doing redundant reads of overflow segments, or whether this is a new defect introduced in recent versions of Oracle.

Arguably Oracle has to visit the overflow segment because sometimes the columns kept in the IOT_TOP segment are so long that it’s not possible to keep all the included columns there – but the reads should surely be an exception rather than the rule; anything else changes the dynamics (and relevance) of IOTs dramatically.

Comments (1)

October 31, 2019

IOT Hash

Filed under: Execution plans,Hash Join,Infrastructure,IOT,Joins,Oracle,Troubleshooting — Jonathan Lewis @ 2:59 pm GMT Oct 31,2019

It’s another of my double-entendre titles. The optimizer can turn a hash join involving an index-organized table into a real performance disaster (though you may have to help it along the way by using a silly definition for your primary key columns). This post was inspired by a question posted on the Oracle Developer Community forum recently so the table and column names I’ve used in my model reflect (almost, with a few corrections) the names used in the post.

We start with a simple requirement expressed through the following SQL:


rem
rem     Script:         iot_hash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2019
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

insert
        /*+
                qb_name(insert)
        */
into t_iot(
        id, inst_id, nr_time,
        o_time, st, null_col, svname
)
select
        /*+
                qb_name(main)
                unnest(@subq)
                leading(@sel$a93afaed apar@main ob@subq)
                use_hash(@sel$a93afaed ob@subq)
                swap_join_inputs(@sel$a93afaed ob@subq)
                index_ss_asc(@sel$a93afaed ob@subq (t_iot.st t_iot.inst_id t_iot.svname))a
        */
        apar.id,
        'UP',
        to_date('2019-10-24','yyyy-mm-dd'),
        to_date('1969-12-31','yyyy-mm-dd'),
        'IDLE',
        null,
        'tkt007.jj.bb.com'
from
        t_base apar
where
        apar.id not in (
                select
                        /*+
                                qb_name(subq)
                        */
                        id
                from
                        t_iot ob
                where
                        inst_id = 'UP'
        )
and     nvl(apar.gp_nm,'UA') = 'UA'
and     rownum <= 5000
/

The requirement is simple – insert into table t_iot a set of values dictated by a subset of the rows in table t_base if they do not already exist in t_iot. To model the issue that appeared I’ve had to hint the SQL to get the following plan (which I pulled from memory after enabling rowsource execution stats):


---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name        | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |             |      1 |        |   296 (100)|      0 |00:00:00.03 |     788 |    148 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL | T_IOT       |      1 |        |            |      0 |00:00:00.03 |     788 |    148 |       |       |          |
|*  2 |   COUNT STOPKEY          |             |      1 |        |            |    100 |00:00:00.03 |      99 |     91 |       |       |          |
|*  3 |    HASH JOIN RIGHT ANTI  |             |      1 |    100 |   296   (2)|    100 |00:00:00.03 |      99 |     91 |    14M|  1843K|   15M (0)|
|*  4 |     INDEX SKIP SCAN      | T_IOT_STATE |      1 |  12614 |   102   (0)|  10000 |00:00:00.01 |      92 |     91 |       |       |          |
|*  5 |     TABLE ACCESS FULL    | T_BASE      |      1 |    100 |     2   (0)|    100 |00:00:00.01 |       7 |      0 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(ROWNUM<=5000)
   3 - access("APAR"."ID"="ID")
   4 - access("INST_ID"='UP')
       filter("INST_ID"='UP')
   5 - filter(NVL("APAR"."GP_NM",'UA')='UA')

The optimizer has unnested (as hinted) the subquery and converted it to an anti-join using a right hash anti-join. Take a look at the Used-mem for the hash join – would it surprise you to learn that the total size of the (not compressed in any way) IOT, and all its indexes, and the t_base table together total less than 4 MB. Something dramatically awful has happened in the hash join to generated a requirement of 14MB. (In the case of the OP this appeared as an unexpected 5GB written to the temporary tablespace.)

Before I address the source of the high memory usage, take a close look at the Predicate Information, particularly operation 3, and ask yourself what the definition of index t_iot_state might be. The predicate joins t_base.id to t_iot.id, and here’s the code to create both tables and all the indexes.

create table t_iot (
        nr_time         timestamp,
        id              varchar2(1024),
        inst_id         varchar2(200),
        o_time          timestamp,
        st              varchar2(200),
        null_col        varchar2(100),
        svname          varchar2(200),
        constraint t_iot_pk primary key(nr_time, id, inst_id)
)
organization index
/

insert into t_iot
select
        sysdate,
        dbms_random.string('l',10),
        'UP',
        sysdate,
        'IDLE',
        null,
        rpad('x',25,'x')
from
        all_objects
where
        rownum <= 1e4 -- > hint to avoid wordpress format issue
/

create index t_iot_state on t_iot(st, inst_id, svname); 
create index idx2        on t_iot(id, inst_id, svname);

create table t_base(
        id              varchar2(400) not null,
        gp_nm           varchar2(200)
)
/

insert into t_base
select
        dbms_random.string('l',10),
        'UA'
from
        all_objects
where
        rownum <= 100 -- > hint to avoid wordpress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_iot',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 't_base',
                cascade     => true,
                method_opt  => 'for all columns size 1'
        );
end;
/

The index t_iot_state that Oracle has used in the hash join is defined on the columns (st, inst_id, svname) – so the predicate is doing a comparison with a column that’s not in the index! At least, it’s not visibly declared in the index; but this is a secondary index on an IOT, and IOTs don’t have “normal” rowids, the rowid in a secondary index is the value of the primary key (plus a “block guess”). So the columns in the index (even though not declared in the index) are: (st, inst_id, svname, {nr_time, id, inst_id, blockguess}). So this index does supply the required id column.

Side note: you’ll see in the list of columns above that inst_id appears twice. In fact (since Oracle 9, I think) the code to handle secondary indexes has been smart enough to avoid this duplication. If the secondary index contains columns from the primary key then the “rowid” doesn’t store those columns, the code knows how to construct the primaryh key value from the stored PK columns combined with the needed columns from the index entry. This can make IOTs a very nice choice of implementation for “intersection” tables that are used to represent many-to-many joins between two other tables.

Unfortunately this “rowid” is the explanation for the massive memory demand. Take a look at the “Column Projection Information” for my execution plan:


Column Projection Information (identified by operation id):
-----------------------------------------------------------
   2 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200], ROWNUM[8]
   3 - (#keys=1) "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]
   4 - "OB".ROWID[ROWID,1249], "NR_TIME"[TIMESTAMP,11], "ID"[VARCHAR2,1024], "INST_ID"[VARCHAR2,200], "OB".ROWID[ROWID,1249]
   5 - "APAR"."ID"[VARCHAR2,400], "APAR"."GP_NM"[VARCHAR2,200]

The interesting line is operation 4. A hash join takes the rowsource from its first child (the build table) and creates an in-memory hash table (which may spill to disc, of course), so if I see an unreasonable memory allocation (or unexpected spill to disc) a good starting point is to look at what the first child is supplying. In this case the first child seems to be saying that it’s supplying (or allowing for) nearly 3,700 bytes to be passed up to the hash join.

On closer inspection we can see it’s reporting the “rowid” twice, and also reporting the three columns that make up that rowid. I think it’s reasonable to assume that it’s only supplying the rowid once, and maybe it’s not even supplying the other three columns because they are embedded in the rowid. Doing a quick arithmetic check, let’s multiply the size of the rowid by the value of A-rows: 1,249 * 10,000 = 12,490,000. That’s pretty close to the 14MB reported by the hash join in operation 3.

Hypothesis – to get at the id column, Oracle has used this index (actually a very bad choice of those available) to extract the rowid and then passed the rowid up to the parent in a (length padded) fixed format. Oracle has then created a hash table by extracting the id column from the rowid and building the hash table on it but also carrying the length-padded rowid into the hash table. Possible variants on this theme are that some or all of the other columns in the Column Projection Information are also passed upwards so that the id doesn’t have to be extracted, but if they are they are not padded to their maximum length.

A simple test that this is broadly the right assumption is to re-run the model making the declared length of the rowid much larger to see what happens to the memory allocation. Changing the inst_id declaration from 200 bytes to 1000 bytes (note the stored value is only the 2 bytes needed for the value ‘UP’) the Used-mem jumps to 23 MB (which is an increment very similar to 800 * 10,000). You’ll note that I chose to experiment with a column that wasn’t the column used in the join. It was a column in the secondary index definition, though, so another test would be to change the nr_time column from a timestamp (11 bytes) to a large varchar2, so I re-ran the test declaring the nr_time as a varchar2(1000) – reverting the inst_id to varchar2(200) – and the Used-mem increased to 25MB.

Preliminary Conclusion

If Oracle uses the contents of the rowid of a secondary index on an IOT in a join then it constructs a fixed format version for the rowid by padding every column in the primary key to its maximum length and concatenating the results. This can have catastrophic side effects on performance if you’ve declared some very long columns “just in case”. Any time you use index organized tables you should remember to check the Column Projection Information in any execution plans that use secondary indexes in case they are passing a large, padded, primary key through the plan to a point where a blocking operation (such as a hash join or merge join) has to accumulate a large number of rows.

Footnote

In my test case I had to hint the query heavily to force Oracle into the path I wanted to demonstrate.

It’s surprising that the optimizer should have chosen this path in the OP’s system, given that there’s another secondary index that contains the necessary columns in its definition. (So one thought is that there’s a statistics problem to address, or possibly the “good” index is subject to updates that make it become very inefficient (large) very quickly.)

Another oddity of the OP’s system was that Oracle should have chosen to do a right hash anti-join when it looked as if joining the tables in the opposite order would produce a much smaller memory demand and lower cost – there was an explict swap_join_inputs() hint in the Outline Information (so copying the outline into the query and changing that to no_swap_join_inputs() might have been abother viable workaround.) In the end the OP hinted the query to use a nested loop (anti-)join from t_base to t_iot – which is another way to avoid the hash table threat with padded rowids.

Comments (1)

July 9, 2019

Assumptions

Filed under: Infrastructure,IOT,Oracle,Philosophy — Jonathan Lewis @ 11:47 am BST Jul 9,2019

Over the last few days I’ve been tweeting little extracts from Practical Oracle 8i, and one of the tweets contained the following quote:

Practical Oracle 8i p.29

"If you are going to depend on a technological feature of Oracle, you need to make sure that you have tried to break it, in half a dozen ways, before you use it in production."

— Jonathan Lewis (@JLOracle) July 7, 2019

This lead to the question:

https://twitter.com/PNosko/status/1148425139572412418

Good question! The whole undo/redo infrastructure in Oracle is probably the most astounding technological achievement in the entire code base – so would you test it to see that it was working properly and if you could break it ? Probably not – although if you were about to recreate your undo tablespace with a 32KB block size you might test to see if the change would produce any surprise side-effects); or you might wonder if anything funny could happen to the redo generation if you created all your varchar2() columns as 4000 bytes “just in case”. or possibly you’d check for undo or redo anomalies if you were told to create a table with more than 255 columns.

I don’t know quite what I was trying to imply (20 years ago) when I wrote the quoted sentence. Possibly I was trying to avoid saying “new features”, because it’s not just the new features you need to test. I was probably trying to suggest the flavour of “exotic”, “high-tech”, “exciting” – which basically comes down to the things where you think you might be (h/t Martin Widlake) a “thought leader” or ground-breaker. If very few people have used some feature of Oracle you might be the first person to use that feature in a specific fashion – so if there’s a surprise (or bug) waiting to be found you’ll be the first to find it and you don’t want to find it in production.

Anything in Oracle might have an odd boundary condition, and life (or the project life-cycle) is too short to test everything – but almost any time you feel you may be going beyond “common usage”, it’s worth thinking about what might go wrong.

As a closing item of entertainment – here’s a little demonstration (last run on 19.2):


rem
rem     Script:         assumptions.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2019
rem     Purpose:
rem
rem     Last tested
rem             19.2.0.0        (LiveSQL)
rem             18.3.0.0
rem
rem     Notes:
rem     Add the predicate "where rownum <= 1600" to test on LiveSSQL
rem

create table t1
as
select * from all_objects
/

create table pt1(
        OWNER, OBJECT_NAME, SUBOBJECT_NAME,
        OBJECT_ID constraint pt1_pk primary key using index local,
        DATA_OBJECT_ID, OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
        TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME, SHARING, EDITIONABLE,
        ORACLE_MAINTAINED, APPLICATION, DEFAULT_COLLATION, DUPLICATED,
        SHARDED, CREATED_APPID, CREATED_VSNID, MODIFIED_APPID, MODIFIED_VSNID
)
partition by hash (object_id) (
        partition p1,
        partition p2,
        partition p3,
        partition p4
)
as
select * from all_objects
/

create table iot1 (
        OWNER, OBJECT_NAME, SUBOBJECT_NAME,
        OBJECT_ID constraint iot1_pk primary key,
        DATA_OBJECT_ID, OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
        TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME, SHARING, EDITIONABLE,
        ORACLE_MAINTAINED, APPLICATION, DEFAULT_COLLATION, DUPLICATED,
        SHARDED, CREATED_APPID, CREATED_VSNID, MODIFIED_APPID, MODIFIED_VSNID
)
organization index
as
select * from all_objects
/

create table ptiot1 (
        OWNER, OBJECT_NAME, SUBOBJECT_NAME,
        OBJECT_ID constraint ptiot1_pk primary key,
        DATA_OBJECT_ID, OBJECT_TYPE, CREATED, LAST_DDL_TIME, TIMESTAMP, STATUS,
        TEMPORARY, GENERATED, SECONDARY, NAMESPACE, EDITION_NAME, SHARING, EDITIONABLE,
        ORACLE_MAINTAINED, APPLICATION, DEFAULT_COLLATION, DUPLICATED,
        SHARDED, CREATED_APPID, CREATED_VSNID, MODIFIED_APPID, MODIFIED_VSNID
)
organization index
partition by hash (object_id) (
        partition p1,
        partition p2,
        partition p3,
        partition p4
)
as
select * from all_objects
/

alter table t1 move online;

alter table pt1 move partition p1 online;

alter table iot1 move online;

alter table ptiot1 move partition p1 online;

It’s a simple test. Copying data from view all_objects I’ve created:

A simple heap table
A hash partitioned heap table – with locally partitioned primary key index
A simple index-organized table
A hash partitioned index organized table

Then I’ve issued an online move command for each table. I often lose track of which enhancements to features appeared in which version of Oracle, but I think the following is correct:

alter table t1 move online – the online option became possible in 12.2
alter table pt1 move partition online – the online option because possible in 12.1
alter table iot1 move online – the online option (for IOTs) became possible in Oracle 8i (and gets a mention in Practical Oracle 8i)
alter table ptiot1 move partition online – any guesses ?)

In the absence of my bait-and-switch lead-up to the final question I think you could be forgiven for assuming that you would be able to move a partition of a partitioned index-organized table online – but even in 19.2 you’ll end up with error message: ORA-14808: table does not support ONLINE MOVE PARTITION.

In a vacuum it’s okay to make the mistake – on the other hand if someone suggested changing a partitioned table in your production system into a partitioned IOT it ought to be one of the first things you’d check (on a small model). Sadly I have been in design meetings where weeks of effort have been spent on producing a detailed design that can’t possibly work because no-one checked to see if some critical detail (like online move of IOT partitions) was actually possible – and that’s the background for the statement:

“If you’re going to depend on a technological feature of Oracle, you need to make sure that you have tried to break it, in half a dozen ways, before you use it in production.”

There are many technological features of Oracle that you can assume (safely) have been tested by many other people – when you get to the edge of the known universe your watchword should be: Here be Dragons.

Comments (4)

March 11, 2019

sys_op_lbid

Filed under: Indexing,Infrastructure,IOT,Oracle,Statistics — Jonathan Lewis @ 1:23 pm GMT Mar 11,2019

I’ve made use of the sys_op_lbid() function a few times in the past, for example in this posting on the dangers of using reverse key indexes, but every time I’ve mentioned it I’ve only been interested in the “leaf blocks per key” option. There are actually at least four different variations of the function, relevant to different types of index and controlled by setting a flag parameter to one of 4 different values.

The call to sys_op_lbid() take 3 parameters: index (or index [sub]partition) object id, a flag value, and a table “rowid”, where the flag value can be one of L, R, O, or G. The variations of the call are as follows:

L – the function will return the row directory address (i.e. something that look like a rowid) of the first index entry in the leaf block that holds the index entry for the referenced table rowid. The effect of this is that the number of distinct values returned by calling the function for every row in the table is equal to the number of index leaf blocks which currently hold an active entry.
R – Relevant only to bitmap indexes; the function will return the row directory address of the bitmap index entry for the referenced table rowid. The effect of this is that the number of distinct values returned by calling the function for every row in the table is equal to the number of index entries in the bitmap index.
O – Relevent only to the primary key index of an index organized table with an overflow. The function is used with a non-key column instead of a rowid and returns a rowid that corresponds to the row directory entry in the overflow segment. An interesting detail of the overflow entries is that there is an “nrid” (next rowid) pointer in the primary key index entry that does not get deleted when all the columns in the related overflow entry are set null – so you can delete all the data from the overflow (set every overflow column in every row to null) and the primary key clustering factor would not change.
G – Relevent only to secondary indexes on an index organized table. Like the L and R options this function takes a rowid (which is a special case for IOTs) as one of its inputs and uses the block guess from the secondary index to construct a row directory entry for the first entry in the primary key leaf block that corresponds to that block guess. This serves two purposes – it allows Oracle to calculate the clustering factor of the secondary index (as you walk the secondary index in order how much do you jump around the leaf blocks of the primary key), and it allows Oracle to produce the pct_direct_access figure for the secondary index by joining the secondary index to the primary key index on primary key, and comparing the ‘G’ result for the secondary with the ‘L’ result from the primary, which gives a count of the number of times the guess is correct.

These observations can be confirmed by gathering stats on different structures with trace enabled, and doing a couple of block dumps. For reference the following is just a simple script to create an index organized table with overflow and secondary index:


rem
rem     Script:         sys_op_lbid_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Dec 2018
rem

create table t1(
        id      constraint t1_pk primary key,
        v1      ,
        v2      ,
        v3      ,
        padding 
)
organization index
pctthreshold 2
overflow
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum,
        lpad(rownum,30),
        lpad(rownum,30),
        lpad(rownum,40),
        rpad('x',100,'x')
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4 -- > comment to avoid WordPress format issue
;

create index t1_i1 on t1(v3);

alter session set sql_trace true;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter session set sql_trace false;

select
        object_id, data_object_id, object_name
from
        user_objects
order by
        object_id
;

The significance of the query for object_id and data_object_id shows up in the trace file (and subsequent dumps) when Oracle uses one or other of the values in its SQL and rowid construction.

Here are the interesting SQL statements generated as the stats are gathered – cosmetically altered to be reader-friendly. In order they are:

Stats for primary key of IOT: using the ‘L’ option for counting leaf blocks and the ‘O’ option for the clustering_factor into overflow segment.
Stats for secondary index of IOT: using the ‘L’ option for counting leaf blocks and the ‘G’ option for the clustering_factor into the primary key index
Calculate pct_direct_access: the ‘L’ option gives the actual leaf block in the primary key index, the ‘G’ option gives the leaf block guessed by the secondary index


select 
        /*+ index(t,t1_pk) */ 
        count(*) as nrw,
        count(distinct sys_op_lbid(351334,'L',t.rowid)) as nlb,
        null as ndk,
        (sys_op_lbid(351334,'O',V1),1) as clf
from
        t1 t 
where 
        id is not null
;


select 
        /*+ index(t,t1_i1) */ 
        count(*) as nrw,
        count(distinct sys_op_lbid(351335,'L',t.rowid)) as nlb,
        null as ndk,
        sys_op_countchg(sys_op_lbid(351335,'G',t.rowid),1) as clf
from
        t1 t 
where 
        v3 is not null
;


select
        case when count(*) = 0
                then 100
                else round(
                        count(
                                case when substr(gdba,7,9)=substr(lbid,7,9)
                                        then 1
                                        else null
                                end
                        )/count(*)*100
                )
        end
from    (
        select
                /*+
                        ordered
                        use_hash(i.t1 t2)
                        index_ffs(t2,t1_pk)
                */
                sys_op_lbid(351334,'L',t2.rowid) lbid,
                gdba
        from (
                select
                        /*+ index_ffs(t1,t1_i1) */
                        sys_op_lbid(351335,'G',t1.rowid) gdba,
                        t1.ID
                from
                        t1 t1
                ) i,
`               t1 t2
        where
                i.id = t2.id
        )
;

The strange substr(,7,9) that appears in the join between the primary key index and the secondary index is needed because the ‘G’ option uses the object_id of the table to turn an absolute block guess into a rowid while the ‘L’ option is using the data_object_id of the primary key index to turn its block address into a rowid. (This means there may be variants of this SQL for IOTs using partitioning.)

Comments (2)

July 16, 2018

Direct IOT

Filed under: 12c,Infrastructure,IOT,Oracle — Jonathan Lewis @ 1:02 pm BST Jul 16,2018

A recent (automatic?) tweet from Connor McDonald highlighted an article he’d written a couple of years ago about an enhancement introduced in 12c that allowed for direct path loads to index organized tables (IOTs). The article included a demonstration that seemed to suggest that direct path loads to IOTs were of no benefit and ended with the comment (which could be applied to any Oracle feature):

“Direct mode insert is a very cool facility but it doesn’t mean that it’s going to be the best option in every situation.”

Clearly it’s necessary to pose the question – “so when would direct mode insert be a good option for IOTs?” – because if it’s never a good option you have to wonder why it has been implemented. This naturally leads on to thinking about which tests have not yet been done – what aspects of IOTs did Connor not get round to examining in his article. (That’s a standard principle of trouble-shooting or testing or investigation: when someone shows you a test case (or when you think you’ve finished testing) one thing you should do before taking the results as gospel is to ask yourself what possible scenarios have not been covered by the test.)

So if you think “IOT” what are the obvious tests once you’ve got past the initial step of loading the IOT and seeing what happens?

First, I think, would be “What if the IOT weren’t empty before the test started”
Second would be “IOTs can have overflow segments, what impact might one have?”
Third would be “Do secondary indexes introduce any side effects?”
Finally “What happens with bitmap indexes and the requirement for a mapping table?”

(Then, of course, you can worry about mixing all the different possibilities together – but for the purposes of this note I’m just going to play with two simple examples: non-empty starting tables, and overflow segments.)

Here’s some code to define a suitable table:

rem
rem     Script:         122_direct_iot.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jun 2018
rem
rem     Last tested 
rem             12.2.0.1

create table t2 
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        3 * rownum                      id,
        lpad(rownum,10,'0')             v1,
        lpad('x',50,'x')                padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
order by
        dbms_random.value
;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T2',
                method_opt  => 'for all columns size 1'
        );
end;
/

create table t1(
        id,
        v1,
        padding,
        constraint t1_pk primary key(id)
)
organization index
-- including v1
-- overflow
nologging
as
select * from t2
;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

I’ve created a heap table t2 that holds 100,000 rows with an id column that arrives randomly ordered; then I’ve used this table as a source to create an IOT (called t1), with the option to have an overflow segment that contains just the 50 character padding column.

I’ve used 3 * rownum to define the id column for t2 so that when I insert another copy of t2 into t1 I can add 1 (or 2) to the t2.id and interleave the new data with the old data. (That’s another thought about IOT testing – are you loading your data in a pre-existing order that suits the special nature of IOTs or is it arriving in a way that’s badly out of order with respect to the natural ordering of the IOT; and does your data go in above the current high value, or is it spread across the whole range, or do you have a partial overlap with the top end of the range and then run on above it.)

Having created the starting data set, here’s the test:


execute snap_my_stats.start_snap
execute snap_events.start_snap

insert 
        /*  append */
into t1
select
        id + 1, v1, padding
from
        t2
;


execute snap_events.end_snap
execute snap_my_stats.end_snap

All I’m doing is using a couple of my snapshot packages to check the work done and time spent while inserting 100,000 interleaved rows – which are supplied out of order – into the existing table. In the text above the “append” is a comment, not a hint, so I’ll be running the test case a total of 4 times:

no overflow, with and without the hint
with the overflow, with and without the hint

(Then, of course, I could run the test without the overflow but an index (i.e. testing the effect of secondary indexes) on v1).

Here are some summary figures from the tests – first from the test without an overflow segment:

                                      Unhinted       With Append
                                  ============      ============
CPU used when call started                 153               102
CPU used by this session                   153               102
DB time                                    166               139

redo entries                           130,603            42,209
redo size                           78,315,064        65,055,376

sorts (rows)                                30           100,031

You’ll notice that with the /*+ append */ hint in place there’s a noticeable reduction in redo entries and CPU time, but this has been achieved at a cost of sorting the incoming data into order. The reduction in redo (entries and size) is due to an “array insert” effect that Oracle can take advantage of with the delayed index maintenance that takes place when the append hint is legal (See the section labelled Option 4 in this note). So even with an IOT with no overflow there’s a potential benefit to gain from direct path loading that depends on how much the new data overlaps the old data, and there’s a penalty that depends on the amount of sorting you’d have to do.

What happens in my case when I move the big padding column out to an overflow segment – here are the equivalent results:


Headline figures                      Unhinted       With Append
================                  ============      ============
CPU used when call started                 158                52
CPU used by this session                   158                52
DB time                                    163                94
redo entries                           116,669            16,690
redo size                           51,392,748        26,741,868
sorts (memory)                               4                 5
sorts (rows)                                33           100,032

Interestingly, comparing the new unhinted results with the previous unhinted results, there’s little difference in the CPU usage between having the padding column in the “TOP” section of the IOT compared to having it in the overflow segment, though there is a significant reduction in redo (the index entries are still going all over the place one by one and causing leaf block splits, but the overflow blocks are being pinned and packed much more efficiently). The difference between having the append hint or not, though, is damatic. We drop to one third of the CPU time (despite still having 100,000 rows to sort) and half the redo. One of the side effects of the overflow, of course, is that the things being sorted are much shorter (only the id and v1 columns that go into the TOP section, and not the whole IOT row).

So, if you already have an overflow segment that caters for a significant percentage of the row it looks as if the benefit you could get from using the /*+ append */ hint could far outweigh the penalty you have to pay in sorting. Of course, an IOT with a large overflow doesn’t look much different from a heap table with index – so perhaps that result isn’t very surprising.

I’ll close by re-iterating Connor’s closing comment:

Direct mode insert is a very cool facility, but it doesn’t mean that it’s going to be the best option in every situation.

Before you dive in and embrace it, or ruthlessly push it to one side, make sure you do some testing that reflects the situations you have to handle.

Comments (1)

September 29, 2016

IOT limitation

Filed under: Execution plans,Infrastructure,IOT,Oracle — Jonathan Lewis @ 10:17 am BST Sep 29,2016

In the right circumstances Index Organized Tables (IOTs) give us tremendous benefits – provided you use them in the ideal fashion. Like so many features in Oracle, though, you often have to compromise between the benefit you really need and the cost of the side effect that a feature produces.

The fundamental design targets for an IOT are that you have short rows and only want to access them through index range scans of the primary key. The basic price you pay for optimised access is the extra work you have to do as you insert the data. Anything you do outside the two specific targets is likely to lead to increased costs of using the IOT – and there’s one particular threat that I’ve mentioned twice in the past (here and here). I want to mention it one more time with a focus on client code and reporting.

rem
rem     Script:         iot_threat.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2014
rem 

create table iot1 (
        id1     number(7.0),
        id2     number(7.0),
        v1      varchar2(10),
        v2      varchar2(10),
        padding varchar2(500),
        constraint iot1_pk primary key(id1, id2)
)
organization index
including id2
overflow
;

insert into iot1
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        mod(rownum,311)                 id1,
        mod(rownum,337)                 id2,
        to_char(mod(rownum,20))         v1,
        to_char(trunc(rownum/100))      v2,
        rpad('x',500,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 --> comment to bypass wordpress format issue
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          => 'IOT1',
                cascade          => true,
                method_opt       => 'for all columns size 1'
        );
end;
/

alter system flush buffer_cache;

select table_name, blocks from user_tables where table_name = 'IOT1' or table_name like 'SYS_IOT_OVER%';
select index_name, leaf_blocks from user_indexes where table_name = 'IOT1';

set autotrace traceonly
select max(v2) from iot1;
set autotrace off

I’ve created an index organized table with an overflow. The table definition places all columns after the id2 column into the overflow segment. After collecting stats I’ve then queried the table with a query that, for a heap table, would produce a tablescan as the execution plan. But there is no “table”, there is only an index for an IOT. Here’s the output I get (results from 11g and 12c are very similar):

TABLE_NAME               BLOCKS
-------------------- ----------
SYS_IOT_OVER_151543        8074
IOT1

INDEX_NAME           LEAF_BLOCKS
-------------------- -----------
IOT1_PK                      504

---------------------------------------------------------------------------------
| Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |         |     1 |     4 | 99793   (1)| 00:00:04 |
|   1 |  SORT AGGREGATE       |         |     1 |     4 |            |          |
|   2 |   INDEX FAST FULL SCAN| IOT1_PK |   100K|   390K| 99793   (1)| 00:00:04 |
---------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
     100376  consistent gets
       8052  physical reads

The index segment has 504 leaf blocks, the overflow segment has 8,074 used blocks below the high water mark. The plan claims an index fast full scan of the index segment – but the statistic for physical reads makes it look more like a “tablescan” of the overflow segment. What’s actually happening ?

The 100,000+ consistent reads should tell you what’s happening – we really are doing an index fast full scan on the index segment, and for each index entry we go to the overflow segment to find the v2 value. Oracle doesn’t have a mechanism for doing a “tablescan” of just the overflow segment – even though the definition of the IOT looks as if it might be telling Oracle exactly which columns are in the overflow.

In my particular test Oracle reported a significant number of “db file scattered read” waits against the overflow segment, but these were for “prefetch warmup”; in a normal system with a buffer cache full of other data this wouldn’t have happened. The other interesting statistic that showed up was “table fetch continued row” – which was (close to) 100,000, again highlighting that we weren’t doing a normal full tablescan.

In terms of normal query processing this anomaly of attempted “tablescans” being index driven probably isn’t an issue but, as I pointed out in one of my earlier posts on the topic, when Oracle gathers stats on the “table” using the approximate_ndv mechanism it will do a “full tablescan”. If you have a very large IOT with an overflow segment this could be a very slow process – especially if you’ve engineered the IOT for the right reason, viz: the data arrives in the wrong order relative to the order you want to query it, and you’ve kept the rows in the IOT_TOP short by dumping the rarely used data in the overflow. With this in mind you might want to make sure that you write a bit of special code that gathers stats only on the columns you know to be in the IOT_TOP, and creates representative numbers for the other columns, then locks the stats until the next time you want to refresh them.

Update (Feb 2021)

Whenever I come across pages where I’ve suggested you limit the number of columns where you collect stats I’ve had to introduce a link to a note I wrote in September 2018 which points out that Oracle still count the number of non-null values for every column even if you’re trying to collect stats on just a couple of columns – so there’s currently no way to do a cheap “collect stats for the IOT TOP, fake them for the overflow”.

Comments (3)

March 2, 2014

Auto Sample Size

Filed under: Function based indexes,Indexing,Infrastructure,IOT,LOBs,Oracle,Statistics — Jonathan Lewis @ 6:38 pm GMT Mar 2,2014

In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:


create bitmap index i_s_rmp_eval_csc_msg_actions on
    s_rmp_evaluation_csc_message (
        decode(instr(xml_message_text,' '),0,0,1)
    )
;

As you might guess from the column name, this is an index based on an XML column, which is stored as a CLOB.

In a similar vein, I showed you a few days ago an old example I had of indexing a CLOB column with a call to dbms_lob.getlength(). Both index examples suffer from the same problem – to support the index Oracle creates a hidden (virtual) column on the table that can be used to hold statistics about the values of the function; actual calculated values for the function call are stored in the index but not on the table itself – but it’s important that the optimizer has the statistics about the non-existent column values.

(more…)

Comments (4)

November 27, 2012

IOT Load

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 5:15 pm GMT Nov 27,2012

When I introduced Connor McDonald’s blog a few days ago, it was because we had exchanged a couple of email messages (through the Oak Table Network) about how to minimise the resource consumption when copying a load of data from one IOT to another of the same structure. His problem was the way in which the obvious way of copying the data resulted in a massive sort even though, in principle, it should not have been necessary to sort anything since the data could have been extracted in order by walking the existing IOT.

As a suggestion I referenced a comment I had made in the Addenda to Practical Oracle 8i about 12 years ago when I had first solved the problem of loading an IOT with minimal logging and no sorting. At the time I had been loading data from a sorted file into an empty table that was then going to be exchanged into a partitioned IOT – but it crossed my mind that loading from a flat file and loading from a UNIX pipe were pretty much the same thing, so perhaps Connor could workaround his problem by making one session spool to a pipe while another session was reading it. In the end, he simply created a massive temporary tablespace, but I thought I’d modify a test script I wrote a few years ago to demonstrate my idea – and here it is:

(more…)

Comments (12)

December 11, 2011

IOT Trap

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 6:04 pm GMT Dec 11,2011

In a recent question on OTN someone asked why Oracle had put some columns into the overflow segment of an IOT when they had specified that they should be in the main index section (the “IOT_TOP”) by using the including clause.

The answer is simple and devious; there’s a little trap hidden in the including clause. It tells Oracle which columns to include, but it gets applied only after Oracle has physically re-arranged the column ordering (internally) to put the primary key columns first. The OP had put the fifth column of the primary key after several extra columns in the table that he wanted in the index section, but Oracle moved that column to the fifth position in the internal table definition, so didn’t include the desired extra columns.
(more…)

Comments (3)

November 27, 2011

IOT Answer

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 10:03 pm GMT Nov 27,2011

It was good to see answers accumulating for the Question on IOTs I posted a couple of days ago. The problem posed was simply this: I have two IOTs and I’ve inserted the same data into them with the same “insert as select” statement. Can you explain the cost of a particular query (and it’s the same for both tables) and extreme differences in work actually done. Here’s the query, the critical stats on the primary key indexes, the shared plan, and the critical execution statistics for running the queries.
(more…)

Comments (2)

November 25, 2011

Quiz Night

Filed under: Indexing,Infrastructure,IOT,Oracle — Jonathan Lewis @ 5:05 pm GMT Nov 25,2011

Inspired by Martin Widlake’s series on IOTs, I thought I’d throw out this little item. In the following, run against 10.2.0.3, tables t3 and t4 are index organized tables, in the same tablespace, with a primary key of (id1, id2) in that order.
(more…)

Comments (22)

November 22, 2011

IOTs

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 9:51 am GMT Nov 22,2011

Updated 2021 with a catalogue of articles that I’ve written about IOTs since I first posted this note.

That’s Index Organized Tables, of course. Searching back through my blog I find that I’ve only written a couple of articles about IOTs although I’m very keen on taking advantage of them and have made a few references to them in other articles. Rather than addressing this oversight myself, I thought I’d direct you to a series on IOTs by Martin Widlake.

Part 1: The basics
Part 2: Examples and Proofs
Part 3: Greatly reduced I/O
Part 4: Boosting buffer cache efficiency
Part 5: Primary Key issues
Part 6a: Inserts and updates slowed down
Part 6b: OLTP inserts into an IOT

Updated Feb 2014 with another worthwhile catalogue of articles

Richard Foote’s array of articles on IOTs

An introduction
IOTs – pctthreshold
Overflow segments – pt. 1
Overflow segments – pt. 2
Secondary Indexes – primary considerations
Secondary Indexes – an introduction
Secondary indexes (logical rowid) – pt.1
Secondary indexes (logical rowid) – pt.2

Update Feb 2021 – a catalogue of my own IOT articles in date order:

A case study of IOTs and an optimizer error (Feb 2008)
IOTs and Block Size (Oct 2008)
Online index rebuilds have an IOT issue (June 2009)
IOTs, FIFOs, Queues and rebuilds (Mar 2011)
A little quiz on IOT execution plans and the answer (Nov 2011)
A trap when using the “including” clause (Dec 2011)
Thoughts on loading an empty IOT as quickly as possible (Nov 2012)
auto_sample_size and IOT overflow segments don’t mix (Mar 2014)
The problem of gathering stats on an IOT (Sept 2016)
IOTs and direct path loads – /*+ append */ (July 2018)
How sys_op_lbid() is used with IOTs (Mar 2019)
Online doesn’t cover everything (Jul 2019)
Hidden costs of hash joins with IOTs (Oct 2019)
Space threat from bug in secondary indexes (Dec 2019)

Comments (5)

March 3, 2011

Index Rebuilds

Filed under: Index Rebuilds,Indexing,Infrastructure,IOT,Oracle,Performance — Jonathan Lewis @ 6:43 pm GMT Mar 3,2011

A couple of days ago I found several referrals coming in from a question about indexing on the Russian Oracle Forum. Reading the thread I found a pointer to a comment I’d written for the Oracle-L list server a couple of years ago about Advanced Queueing and why you might find that it was necessary to rebuild the IOTs (index organized tables) that support AQ.

The queue tables are, of course, a perfect example of what I call the “FIFO” index so it’s not a surprise that they might need special consideration. Rather than rewrite the whole note I’ll just link to it from here. (One of the notes in the rest of the Oracle-L thread also points to MOS document 271855.1 which describes the whys and hows of rebuilding AQ tables.)

Comments (3)

June 5, 2009

Online Rebuild

Filed under: Index Rebuilds,Indexing,Infrastructure,IOT,Oracle,Troubleshooting — Jonathan Lewis @ 7:44 pm BST Jun 5,2009

Here’s a little oddity that may be waiting to catch you out – but only if you like to create indexes with very long keys.

rem
rem     Script:         index_rebuild.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2009
rem

create table t1(
        v1      varchar2(4000),
        v2      varchar2(2387),
        v3      varchar2(100)
);

create index t1_i1 on t1(v1, v2);

alter index t1_i1 rebuild;
alter index t1_i1 rebuild online;

My key value is at the limit for an 8KB block size in Oracle 9i and later – which is roughly 80% of (block size – 190 bytes). In earlier versions of Oracle (prior to 9i) the limit was roughly half that (i.e. 40% rather than 80%).

(more…)

Comments (19)

October 28, 2008

IOTs and blocksize

Filed under: Block Size,Infrastructure,IOT,Oracle,Performance,Tuning — Jonathan Lewis @ 7:17 pm GMT Oct 28,2008

A question came up on the Oracle database forum a few months ago asking:

What are the benefits and the downside of using IOTs on 16k blocks? Would you recommend it?

I think the best response to the generic question about block sizing came from Greg Rahn in another thread on the forum:

If someone has to ask what block size they need. The answer is always 8KB.^[1]

(more…)

Comments (6)

December 16, 2019

tl;dr

Footnote

Update (Jan 2020)

Update (Nov 2022)

October 31, 2019

Preliminary Conclusion

Footnote

July 9, 2019

March 11, 2019

July 16, 2018

September 29, 2016

Update (Feb 2021)

March 2, 2014

November 27, 2012

December 11, 2011

November 27, 2011

November 25, 2011

November 22, 2011

March 3, 2011

June 5, 2009

October 28, 2008

Click on the motto below

Search this blog

Categories

Special Links

Recent Posts

Recent Comments

Popular articles

Popular References

Archives

Blogroll

Posts by RSS