Oracle Scratchpad

July 16, 2018

Direct IOT

Filed under: 12c,Infrastructure,IOT,Oracle — Jonathan Lewis @ 1:02 pm GMT Jul 16,2018

A recent (automatic ?) tweet from Connor McDonald highlighted an article he’d written a couple of years ago about an enhancement introduced in 12c that allowed for direct path inserts to index organized tables (IOTs). The article included a demonstration seemed to suggest that direct path loads to IOTs were of no benefit, and ended with the comment (which could be applied to any Oracle feature): “Direct mode insert is a very cool facility, but it doesn’t mean that it’s going to be the best option in every situation.”

Clearly it’s necessary to pose the question – “so when would direct mode insert be a good option for IOTs?” – because if it’s never a good option you have to wonder why it has been implemented. This naturally leads on to thinking about which tests have not yet been done – what aspects of IOTs did Connor not get round to examining in his article. (That’s a standard principle of trouble-shooting, or testing, or investigation: when someone shows you a test case (or when you think you’ve finished testing) one thing you should do before taking the results as gospel is to ask yourself what possible scenarios have not been covered by the test.)

So if you say IOT what are the obvious tests once you’ve got past the initial step of loading the IOT and seeing what happens. First, I think, would be “What if the IOT weren’t empty before the test started”; second would be “IOTs can have overflow segments, what impact might one have?”; third would be “Do secondary indexes have any effects?”; finally “What happens with bitmap indexes and the requirement for a mapping table?” (Then, of course, you can worry about mixing all the different possibilities together – but for the purposes of this note I’m just going to play with two simple examples: non-empty starting tables, and overflow segments.)

Here’s some code to define a suitable table:

create table t2 
with generator as (
                rownum id
        from dual 
        connect by 
                level <= 1e4 -- > comment to avoid WordPress format issue
	3 * rownum			id,
	lpad(rownum,10,'0')		v1,
	lpad('x',50,'x')		padding
        generator       v1,
        generator       v2
        rownum <= 1e5 -- > comment to avoid WordPress format issue
order by

		ownname     => null,
		tabname     => 'T2',
		method_opt  => 'for all columns size 1'

create table t1(
	constraint t1_pk primary key(id)
organization index
-- including v1
-- overflow
select * from t2

		ownname     => null,
		tabname     => 'T1',
		method_opt  => 'for all columns size 1'

I’ve created a heap table t2 that holds 100,000 rows with an id column that is randomly ordered; then I’ve used this ta1ble as a source to create an IOT, with the option to have an overflow segment that contains just the 100 character padding columns. I’ve used 3 * rownum to define the id column for t2 so that when I insert another copy of t2 into t1 I can add 1 (or 2) to the id and interleave the new data with the old data. (That’s another thought about IOT testing – are you loading your data in a pre-existing order that suits the IOTs or is it arriving in a way that’s badly out of order with respect to the IOT ordering; and does your data go in above the current high value, or spread across the whole range, or have a partial overlap with the top end of the range and then run on above it.)

Have created the starting data set, here’s the test:

execute snap_my_stats.start_snap
execute snap_events.start_snap

	/*  append */
into t1
	id + 1, v1, padding

execute snap_events.end_snap
execute snap_my_stats.end_snap

All I’m doing is using a couple of my snapshot packages to check the work done and time spent while insert 100,000 interleaved rows – which are supplied out of order – into the existing table. As shown the “append” is a comment, not a hint, so I’ll be running the test case a total of 4 times: no overflow, with and without the hint – then with the overflow, with and without the hint. (Then, of course, I could run the test without the overflow but an index on v1).

Here are some summary figures from the tests – first from the test without an overflow segment:

                                      Unhinted       With Append
                                  ============      ============
CPU used when call started                 153               102
CPU used by this session                   153               102
DB time                                    166               139

redo entries                           130,603            42,209
redo size                           78,315,064        65,055,376

sorts (rows)                                30           100,031

You’ll notice that with the /*+ append */ hint in place there’s a noticeable reduction in redo entries and CPU time, but this has been achieved at a cost of sorting the incoming data into order. The reduction in redo (entries and size) is due to an “array insert” effect that Oracle can take advantage of with the delayed index maintenance that takes place when the append hint is legal (See the section labelled Option 4 in this note). So even with an IOT with no overflow there’s a potential benefit to gain from direct path loading that depends on how much the new data overlaps the old data, and there’s a penalty that depends on the amount of sorting you’d have to do.

What happens in my case when I move the big padding column out to an overflow segment – here are the equivalent results:

Headline figures                      Unhinted       With Append
================                  ============      ============
CPU used when call started                 158                52
CPU used by this session                   158                52
DB time                                    163                94
redo entries                           116,669            16,690
redo size                           51,392,748        26,741,868
sorts (memory)                               4                 5
sorts (rows)                                33           100,032

Interestingly, comparing the unhinted results with the previous unhinted results, there’s little difference in the CPU usage between having the padding column in the “TOP” section of the IOT compared to having it in the overflow segment, though there is a significant reduction in redo (the index entries are still going all over the place one by one, but the overflow blocks are being pinned and packed much more efficiently). The difference between having the append hint or not, though, is damatic. One third of the CPU time (despited still having 100,000 rows to sort), and half the redo. One of the side effects of the overflow, of course, is that the things being sorted are much shorted (only the id and v1 columns that go into the TOP section, and not the whole IOT row.

So, if you already have an overflow segment that caters for a significant percentage of the row, it looks as if the benefit you could get from using the /*+ append */ hint would far outweigh the penalty of sorting you have to pay. Of course, an IOT with a large overflow doesn’t look much different from a heap table with index – so perhaps that result isn’t very surprising.

I’ll close by re-iterating Connor’s closing comment:

Direct mode insert is a very cool facility, but it doesn’t mean that it’s going to be the best option in every situation.

Before you dive in and embrace it, or ruthlessly push it to one side, make sure you do some testing that reflects the situations you have to handle.

September 29, 2016

IOT limitation

Filed under: Execution plans,Infrastructure,IOT,Oracle — Jonathan Lewis @ 10:17 am GMT Sep 29,2016

In the right circumstances Index Organized Tables (IOTs) give us tremendous benefits – provided you use them in the ideal fashion. Like so many features in Oracle, though, you often have to compromise between the benefit you really need and the cost of the side effect that a feature produces.

The fundamental design targets for an IOT are that you have short rows and only want to access them through index range scans of the primary key. The basic price you pay for optimised access is the extra work you have to do as you insert the data. Anything you do outside the two specific targets is likely to lead to increased costs of using the IOT – and there’s one particular threat that I’ve mentioned twice in the past (here and here). I want to mention it one more time with a focus on client code and reporting.

create table iot1 (
        id1     number(7.0),
        id2     number(7.0),
        v1      varchar2(10),
        v2      varchar2(10),
        padding varchar2(500),
        constraint iot1_pk primary key(id1, id2)
organization index
including id2

insert into iot1
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
        mod(rownum,311)                 id1,
        mod(rownum,337)                 id2,
        to_char(mod(rownum,20))         v1,
        to_char(trunc(rownum/100))      v2,
        rpad('x',500,'x')               padding
        generator       v1,
        generator       v2
        rownum <= 1e5 --> comment to bypass wordpress format issue


        dbms_stats.gather_table_stats( ownname => user,
                tabname          => 'IOT1'
                method_opt       => 'for all columns size 1'

alter system flush buffer_cache;

select table_name, blocks from user_tables where table_name = 'IOT1' or table_name like 'SYS_IOT_OVER%';
select index_name, leaf_blocks from user_indexes where table_name = 'IOT1';

set autotrace traceonly
select max(v2) from iot1;
set autotrace off

I’ve created an index organized table with an overflow. The table definition places all columns after the id2 column into the overflow segment. After collecting stats I’ve then queried the table with a query that, for a heap table, would produce a tablescan as the execution plan. But there is no “table”, there is only an index for an IOT. Here’s the output I get (results from 11g and 12c are very similar):

TABLE_NAME               BLOCKS
-------------------- ----------
SYS_IOT_OVER_151543        8074

-------------------- -----------
IOT1_PK                      504

| Id  | Operation             | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT      |         |     1 |     4 | 99793   (1)| 00:00:04 |
|   1 |  SORT AGGREGATE       |         |     1 |     4 |            |          |
|   2 |   INDEX FAST FULL SCAN| IOT1_PK |   100K|   390K| 99793   (1)| 00:00:04 |

     100376  consistent gets
       8052  physical reads

The index segment has 504 leaf blocks, the overflow segment has 8,074 used blocks below the high water mark. The plan claims an index fast full scan of the index segment – but the physical reads statistic looks more like a “table” scan of the overflow segment. What’s actually happening ?

The 100,000+ consistent reads should tell you what’s happening – we really are doing an index fast full scan on the index segment, and for each index entry we go to the overflow segment to find the v2 value. Oracle doesn’t have a mechanism for doing a “tablescan” of just the overflow segment – even though the definition of the IOT ought (apparently) to be telling Oracle exactly which columns are in the overflow.

In my particular test Oracle reported a significant number of “db file scattered read” waits against the overflow segment, but these were for “prefetch warmup”; in a normal system with a buffer cache full of other data this wouldn’t have happened. The other interesting statistic that showed up was “table fetch continued row” – which was (close to) 100,000, again highlighting that we weren’t doing a normal full tablescan.

In terms of normal query processing this anomaly of attempted “tablescans” being index driven probably isn’t an issue but, as I pointed out in one of my earlier posts on the topic, when Oracle gathers stats on the “table” it will do a “full tablescan”. If you have a very large table with an overflow segment it could be a very slow process – especially if you’ve engineered the IOT for the right reason, viz: the data arrives in the wrong order relative to the order you want to query it, and you’ve kept the rows in the IOT_TOP short by dumping the rarely used data in the overflow. With this in mind you might want to make sure that you write a bit of special code that gathers stats only on the columns you know to be in the IOT_TOP, and creates representative numbers for the other columns, then locks the stats until the next time you want to refresh them.


May 9, 2016

RI Locks

Filed under: deadlocks,Indexing,IOT,Locks,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 12:24 pm GMT May 9,2016

RI = Referential Integrity: also known informally as parent/child integrity, and primary (or unique) key/foreign key checking.

I’m on a bit of a roll with things that I must have explained dozens or even hundreds of times in different environments without ever formally explaining them on my blog. Here’s a blog item I could have done with to response to  a question that came up on the OTN database forum over the weekend.

What happens in the following scenario:

-- session 1

create table parent (
        id        number(8,0),
        constraint par_pk primary key(id)

create table child  (
        id_p      number(8,0) not null references parent,
        id_c      number(8,0) not null,
        constraint child_pk primary key(id_p, id_c)

insert into parent values(1);

-- session 2
insert into child values(1,1);

Since the parent row corresponding to the child row doesn’t (yet) seem to exist as far as session 2 is concerned you might expect session 2 to respond immediately with an error message like:

ERROR at line 1:
ORA-02291: integrity constraint (TEST_USER.SYS_C0017926) violated - parent key not found

In fact, although the end-user is not allowed to see the uncommitted parent row, the user’s process can see the uncommitted row and will wait until session 1 commits or rolls back – so if you examine v$lock for the current locks for the two sessions you’d see something like this:

  1  select  sid, type, id1, id2, lmode, request, ctime, block
  2  from    V$lock
  3  where   sid in (select sid from V$session where username = 'TEST_USER')
  4  and     type != 'AE'
  5  order by
  6*         sid, type desc
  7  /

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
         3 TX     327709      12584          6          0        283          1
           TM     143734          0          2          0        283          0
           TM     143732          0          3          0        283          0

       250 TX     589829      12877          6          0        240          0
           TX     327709      12584          0          4        240          0
           TM     143734          0          3          0        240          0
           TM     143732          0          3          0        240          0

7 rows selected.

In the above, SID 250 is session 2: it’s holding a transaction lock (TX) in mode 6 because it has acquired an undo segment and has generated some undo, it’s also waiting for a transaction lock in mode 4 (share) and – checking id1 and id2 – we can see that the transaction table entry it’s waiting for is held by session 3 in mode 6 (and we also note that the lock held by session 3 is marked as a blocker).

If session 3 commits (thus releasing the transaction lock) session 250 will continue processing the insert; if session 3 rolls back session 250 will raise error ORA-02291 and roll back its insert statement. (Note: if this were a multi-statement transaction it would only be the insert into child that would be rolled back; that’s another one of those details that is important but often isn’t stated explicitly, leaving people believing that the entire transaction would be rolled back.)

Updates and deletes can produce the same effects. Imagine that we have just created the two tables, and then run the following:

-- session 1
insert into parent values(1);
delete from parent where id = 1;

-- session 2
insert into child values(1,1);

Again session 2 will wait for session 1 to commit or roll back. In this case if session 1 commits session 2 will raise Oracle error ORA-02291, if session 1 rolls back session 2 will continue with the insert.


Whenever you can demonstrate a way of producing a wait chain you can also manage to produce a deadlock. Consider the following (starting, again, from empty tables);

-- (1) session 1
insert into parent values(1);

-- (2) session 2
insert into parent values(2);

-- (3) session 1
insert into child values(2,2);

-- (4)session 2
insert into child values(1,1);

Session 1 will start waiting for session 2 to commit (or rollback) at step 3, then session 2 will start to wait for session 1 at step 4 – with the result that session 1 will recognise the deadlock after about three seconds and rollback its last statement, raising exception ORA-00060 and dumping a trace file. (Note: session 1 will not, as many people think, roll back the entire transaction, it will only roll back the statement that allowed the deadlock to develop). Session 2 will still be waiting for session 1 to commit or rollback its insert into parent. Contrary to the popular claim, Oracle will not “resolve” the deadlock, it will simply break the deadlock leaving one session waiting for the other session to respond appropriately to the deadlock error.

For reference, here’s the deadlock graph (from a 12c trace file) produced by session 1 (SID = 3) for this demo:

Deadlock graph:
                                          ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name                             process session holds waits  process session holds waits
TX-00010017-000026C7-00000000-00000000          6       3     X             33     250           S
TX-000A000D-000026F8-00000000-00000000         33     250     X              6       3           S

session 3: DID 0001-0006-00000004       session 250: DID 0001-0021-00000041
session 250: DID 0001-0021-00000041     session 3: DID 0001-0006-00000004

Rows waited on:
  Session 3: no row
  Session 250: no row

When you see a deadlock graph with TX waits of type S (share, mode 4) it’s a very good bet that the wait has something to do with indexes – which may mean referential integrity as discussed here, but may mean collisions on primary keys, and may mean something to do with simple collisions on index-organized tables. You’ll notice that the “Rows waited on:” section shows no row – unfortunately in earlier versions of Oracle you may find a spurious row entry here because the wait information from some other (block) wait has been left in the relevant columns in v$session.

March 2, 2014

Auto Sample Size

Filed under: Function based indexes,Indexing,Infrastructure,IOT,LOBs,Oracle,Statistics — Jonathan Lewis @ 6:38 pm GMT Mar 2,2014

In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:

create bitmap index i_s_rmp_eval_csc_msg_actions on
    s_rmp_evaluation_csc_message (
        decode(instr(xml_message_text,' '),0,0,1)

As you might guess from the column name, this is an index based on an XML column, which is stored as a CLOB.

In a similar vein, I showed you a few days ago an old example I had of indexing a CLOB column with a call to dbms_lob.getlength(). Both index examples suffer from the same problem – to support the index Oracle creates a hidden (virtual) column on the table that can be used to hold statistics about the values of the function; actual calculated values for the function call are stored in the index but not on the table itself – but it’s important that the optimizer has the statistics about the non-existent column values.


November 27, 2012

IOT Load

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 5:15 pm GMT Nov 27,2012

When I introduced Connor McDonald’s blog a few days ago, it was because we had exchanged a couple of email messages (through the Oak Table Network) about how to minimise the resource consumption when copying a load of data from one IOT to another of the same structure. His problem was the way in which the obvious way of copying the data resulted in a massive sort even though, in principle, it should not have been necessary to sort anything since the data could have been extracted in order by walking the existing IOT.

As a suggestion I referenced a comment I had made in the Addenda to Practical Oracle 8i about 12 years ago when I had first solved the problem of loading an IOT with minimal logging and no sorting. At the time I had been loading data from a sorted file into an empty table that was then going to be exchanged into a partitioned IOT – but it crossed my mind that loading from a flat file and loading from a UNIX pipe were pretty much the same thing, so perhaps Connor could workaround his problem by making one session spool to a pipe while another session was reading it. In the end, he simply created a massive temporary tablespace, but I thought I’d modify a test script I wrote a few years ago to demonstrate my idea – and here it is:


December 11, 2011

IOT Trap

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 6:04 pm GMT Dec 11,2011

In a recent question on OTN someone asked why Oracle had put some columns into the overflow segment of an IOT when they had specified that they should be in the main index section (the “IOT_TOP”) by using the including clause.

The answer is simple and devious; there’s a little trap hidden in the including clause. It tells Oracle which columns to include, but it gets applied only after Oracle has re-arranged the column ordering (internally) to put the primary key columns first. The OP had put the last column of the primary key AFTER the dozen columns in the table that he wanted in the index section, but Oracle moved that column to the fifth position in the internal table definition, so didn’t include the desired 10 extra columns.

November 27, 2011

IOT Answer

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 10:03 pm GMT Nov 27,2011

It was good to see the answers to the last Quiz Night accumulating. The problem posed was simply this: I have two IOTs and I’ve inserted the same data into them with the same “insert as select” statement. Can you explain the cost of a particular query (and it’s the same for both tables) and extreme differences in work actually done. Here’s the query, the critical stats on the primary key indexes, the shared plan, and the critical execution statistic for running the plan.

November 22, 2011


Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 9:51 am GMT Nov 22,2011

That’s Index Organized Tables, of course. Searching back through my blog I find that I’ve only written one article about IOTs– although I’m very keen on taking advantage of them, and have made a few references to them in other articles. Rather than addressing this oversight myself, I thought I’d direct you to a series on IOTs by Martin Widlake.

Updated Feb 2014 with another worthwhile catalogue of articles

Richard Foote’s array of articles on IOTs

October 28, 2008

IOTs and blocksize

Filed under: Block Size,Infrastructure,IOT,Oracle,Performance,Tuning — Jonathan Lewis @ 7:17 pm GMT Oct 28,2008

A question came up on the Oracle database forum a few months ago asking:

What are the benefits and the downside of using IOTs on 16k blocks? Would you recommend it?

I think the best response to the generic question about block sizing came from Greg Rahn in another thread on the forum:

If someone has to ask what block size they need. The answer is always 8KB.***


February 25, 2008

IOTs and 10053

Filed under: CBO,Execution plans,Infrastructure,IOT,Oracle,trace files,Troubleshooting — Jonathan Lewis @ 9:03 pm GMT Feb 25,2008

I’ve probably got a couple of comments about 10053 trace files lurking somewhere on this blog and on my website – and when I mention the 10053 I usually remember to say that it’s a last resort that I only use when I think there may be a bug that needs to be nailed.

So here’s a bit of a 10053 – which I only looked at because I thought it was going to show me a bug.

Powered by