Oracle Scratchpad

October 1, 2014

Shrink Tablespace

Filed under: fragmentation,Infrastructure,Oracle — Jonathan Lewis @ 2:55 pm GMT Oct 1,2014

In a comment on my previous post on shrinking tablespaces Jason Bucata and Karsten Spang both reported problems with small objects that didn’t move to the start of the tablespace. This behaviour is inevitable with dictionary managed tablespaces (regardless of the size of the object), but I don’t think it’s likely to happen with locally managed tablespaces if they’ve been defined with uniform extent sizes. Jason’s comment made me realise, though, that I’d overlooked a feature of system allocated tablespaces that made it much harder to move objects towards the start of file. I’ve created a little demo to illustrate the point.

I created a new tablespace as locally managed, ASSM, and auto-allocate, then created a few tables or various sizes. The following minimal SQL query reports the resulting extents in block_id order, adding in a “boundary_1m” column which subtracts 128 blocks (1MB) from the block_id, then divides by 128 and truncates to show which “User Megabyte” in the file the extent starts in.  (Older versions of Oracle typically have an 8 block space management header, recent versions expanded this from 64KB to 1MB – possibly as a little performance aid to Exadata).


select
        segment_name, block_id, blocks , trunc((block_id - 128)/128) boundary_1M
from
        dba_extents where owner = 'TEST_USER'
order by
        block_id
;

SEGMENT_NAME               BLOCK_ID     BLOCKS BOUNDARY_1M
------------------------ ---------- ---------- -----------
T1                              128       1024           0
T1                             1152       1024           8
T2                             2176       1024          16
T2                             3200       1024          24
T3                             4224          8          32
T4                             4232          8          32
T5                             4352        128          33

As you can see t3 and t4 are small tables – 1 extent of 64KB each – and t5, which I created after t4, starts on the next 1MB boundary. This is a feature of auto-allocate: not only are extents (nearly) fixed to a small number of possible extent sizes, the larger extents are restricted to starting on 1MB boundaries and the 64KB extents are used preferentially to fill in odd-sized” holes. To show the impact of this I’m going to drop table t1 (at the start of file) to make some space.


SEGMENT_NAME               BLOCK_ID     BLOCKS BOUNDARY_1M
------------------------ ---------- ---------- -----------
T2                             2176       1024          16
T2                             3200       1024          24
T3                             4224          8          32
T4                             4232          8          32
T5                             4352        128          33

Now I’ll move table t3 – hoping that it will move to the start of file and use up some of the space left by t1. However there’s a 1MB area (at boundary 32) which is partially used,  so t3 moves into that space rather than creating a new “partly used” megabyte.


SEGMENT_NAME               BLOCK_ID     BLOCKS BOUNDARY_1M
------------------------ ---------- ---------- -----------
T2                             2176       1024          16
T2                             3200       1024          24
T4                             4232          8          32
T3                             4240          8          32
T5                             4352        128          33

It’s a little messy trying to clear up the tiny fragments and make them do what you want. In this case you could, for example, create a dummy table with storage(initial 64K next 64K minextents 14) to use up all the space in the partly used megabyte, then move t3 – which should go to the start of file – then move table t4 – which should go into the first partly-used MB (i.e. start of file) rather than taking up the hole left by t3.

Even for a trivial example it’s messy – imagine how difficult it can get to cycle through building and dropping suitable dummy tables and move objects in the right order when you’ve got objects with several small extents scattered through the file, and objects with a mixture of small extents and large extents.

September 21, 2014

Partitioned Clusters

Filed under: 12c,clusters,Infrastructure,Oracle — Jonathan Lewis @ 7:28 pm GMT Sep 21,2014

In case you hadn’t noticed it, partitioning has finally reached clusters in 12c – specifically 12.1.0.2. They’re limited to hash clusters with range partitioning, but it may be enough to encourage more people to use the technology. Here’s a simple example of the syntax:


create cluster pt_hash_cluster (
        id              number(8,0),
        d_date          date,
        small_vc        varchar2(8),
        padding         varchar2(100)
)
-- single table
hashkeys 10000
hash is id
size 700
partition by range (d_date) (
        partition p2011Jan values less than (to_date('01-Feb-2011','dd-mon-yyyy')),
        partition p2011Feb values less than (to_date('01-Mar-2011','dd-mon-yyyy')),
        partition p2011Mar values less than (to_date('01-Apr-2011','dd-mon-yyyy'))
)
;

I’ve been waiting for them to appear ever since 11.2.0.1 and the TPC-C benchmark that Oracle did with them – they’ve been a long time coming (check the partition dates – that gives you some idea of when I wrote this example).

Just to add choice (a.k.a. confusion) 12.1.0.2 has also introduce attribute clustering so you can cluster data in single tables without creating clusters – but only while doing direct path loads or table moves. The performance intent is similar, though the technology and circumstances of use are different.

September 19, 2014

Shrink Tablespace

Filed under: fragmentation,Infrastructure,Oracle — Jonathan Lewis @ 12:10 pm GMT Sep 19,2014

A recent question on the OTN database forum raised the topic of returning free space in a tablespace to the operating system by rebuilding objects to fill the gaps near the start of files and leave the empty space at the ends of files so that the files could be resized downwards.

This isn’t a process that you’re likely to need frequently, but I have written a couple of notes about it, including a sample query to produce a map of the free and used space in a tablespace. While reading the thread, though, it crossed my mind that recent versions of Oracle introduced a feature that can reduce the amount of work needed to get the job done, so I thought I’d demonstrate the point here.

When you move a table its indexes become unusable and have to be rebuilt; but when an index becomes unusable, the more recent versions of Oracle will drop the segment. Here’s a key point – if the index becomes unusable because the table has been moved the segment is dropped only AFTER the move has completed. Pause a minute for thought and you realise that the smart thing to do before you move a table is to make its indexes unusable so that they release their space BEFORE you move the table. (This strategy is only relevant if you’ve mixed tables and indexes in the same tablespace and if you’re planning to do all your rebuilds into the same tablespace rather than moving everything into a new tablespace.)

Here are some outputs demonstrating the effect in a 12.1.0.2 database. I have created (and loaded) two tables in a tablespace of 1MB uniform extents, 8KB block size; then I’ve created indexes on the two tables. Running my ts_hwm.sql script I get the following results for that tablespace:


FILE_ID    BLOCK_ID   END_BLOCK OWNER      SEGMENT_NAME    SEGMENT_TYPE
------- ----------- ----------- ---------- --------------- ------------------
      5         128         255 TEST_USER  T1              TABLE
                256         383 TEST_USER  T2              TABLE
                384         511 TEST_USER  T1_I1           INDEX
                512         639 TEST_USER  T2_I1           INDEX
                640      65,535 free       free

Notice that it was a nice new tablespace, so I can see the two tables followed by the two indexes at the start of the tablespaces. If I now move table t1 and re-run the script this is what happens:


alter table t1 move;

FILE_ID    BLOCK_ID   END_BLOCK OWNER      SEGMENT_NAME    SEGMENT_TYPE
------- ----------- ----------- ---------- --------------- ------------------
      5         128         255 free       free
                256         383 TEST_USER  T2              TABLE
                384         511 free       free
                512         639 TEST_USER  T2_I1           INDEX
                640         767 TEST_USER  T1              TABLE
                768      65,535 free       free

Table t1 is now situated past the previous tablespace highwater mark and I have two gaps in the tablespace where t1 and the index t1_i1 used to be.

Repeat the experiment from scratch (drop the tables, purge, etc. to empty the tablespace) but this time mark the index unusable before moving the table and this is what happens:


FILE_ID    BLOCK_ID   END_BLOCK OWNER      SEGMENT_NAME    SEGMENT_TYPE
------- ----------- ----------- ---------- --------------- ------------------
      5         128         255 free       free
                256         383 TEST_USER  T2              TABLE
                384         511 TEST_USER  T1              TABLE
                512         639 TEST_USER  T2_I1           INDEX
                640      65,535 free       free

Table t1 has moved into the space vacated by index t1_i1, so the tablespace highwater mark has not moved up.

If you do feel the need to reclaim space from a tablespace by rebuilding objects, you can find that it’s quite hard to decide the order in which the objects should be moved/rebuilt to minimise the work you (or rather, Oracle) has to do; if you remember that any table you move will release its index space anyway and insert a step to mark those indexes unusable before you move the table you may find it’s much easier to work out a good order for moving the tables.

Footnote: I appreciate that some readers may already take advantage of the necessity of rebuilding indexes by dropping indexes before moving tables – but I think it’s a nice feature that we can now make them unusable and get the same benefit without introducing a risk of error when using a script to recreate an index we’ve dropped.

 

September 8, 2014

ASSM Truncate.

Filed under: ASSM,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 11:34 am GMT Sep 8,2014

Here’s one that started off with a tweet from Kevin Closson, heading towards a finish that shows some interesting effects when you truncate large objects that are using ASSM. To demonstrate the problem I’ve set up a tablespace using system allocation of extents and automatic segment space management (ASSM).  It’s the ASSM that causes the problem, but it requires a mixture of circumstances to create a little surprise.


create
	tablespace test_8k_auto_assm
	datafile	-- OMF
	SIZE 1030M
	autoextend off
	blocksize 8k
	extent management local
	autoallocate
	segment space management auto
;

create table t1 (v1 varchar2(100)) pctfree 99 tablespace test_8k_auto_assm storage(initial 1G);

insert into t1 select user from dual;
commit;

alter system flush buffer_cache;

truncate table t1;

I’ve created a table with an initial definition of 1GB, which means that (in a clean tablespace) the autoallocate option will jump straight to extents of 64MB, with 256 table blocks mapped per bitmap block for a total of 32 bitmap blocks in each 64MB extent. Since I’m running this on 11.2.0.4 and haven’t included “segment creation immediate” in the definition I won’t actually see any extents until I insert the first row.

So here’s the big question – when I truncate this table (using the given command) how much work will Oracle have to do ?

Exchanging notes over twitter (140 char at a time) and working from a model of the initial state, it took a little time to get to understand what was (probably) happening and then produce this silly example – but here’s the output from a snapshot of v$session_event for the session across the truncate:


Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
local write wait                                    490           0          83.26        .170          13
enq: RO - fast object reuse                           2           0         104.90      52.451         105
db file sequential read                              47           0           0.05        .001           0
db file parallel read                                 8           0           0.90        .112           0
SQL*Net message to client                            10           0           0.00        .000           0
SQL*Net message from client                          10           0           0.67        .067         153
events in waitclass Other                             2           0           0.04        .018         109

The statistic I want to highlight is the number recorded against “local write wait”: truncating a table of one row we wait for 490 blocks to be written! We also have 8 “db file parallel read”  waits which, according to a 10046 trace file, were reading hundreds of blocks. (I think the most significant time in this test – the RO enqueue wait – may have been waiting for the database writer to complete the work needed for an object checkpoint, but I’m not sure of that.)

The blocks written were the space management bitmap blocks for the extent(s) that remained after the truncate – even the ones that referenced extents above the high water mark for the table. Since we had set the tables initial storage to 1GB, we had a lot of bitmap blocks. At 32 per extent and 16 extents (64MB * 16 = 1GB) we might actually expect something closer to 512 blocks, but actually Oracle had formatted the last extent with only 8 space management blocks. and the first extent had an extra 2 to cater for the level 2 bitmap lock and segment header block giving: 32 * 15 + 8 + 2 = 490.

As you may have seen above, the impact on the test that Kevin was doing was quite dramatic – he had set the initial storage to 128GB (lots of bitmap blocks), partitioned the table (more bitmap blocks) and was running RAC (so the reads were running into waits for global cache grants).

I had assumed that this type of behaviour happened only with the “reuse storage” option of the truncate command: and I hadn’t noticed before that it also appeared even if you didn’t reuse storage – but that’s probably because the effect applies only to the bit you keep, which may typically mean a relatively small first extent. It’s possible, then, that in most cases this is an effect that isn’t going to be particularly visible in production systems – but if it is, can you work around it ? Fortunately another tweeter asked the question “What happens if you ‘drop all storage?'” Here’s the result from adding that clause to my test case:


Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: RO - fast object reuse                           1           0           0.08        .079           0
log file sync                                         1           0           0.03        .031           0
db file sequential read                              51           0           0.06        .001           0
SQL*Net message to client                            10           0           0.00        .000           0
SQL*Net message from client                          10           0           0.56        .056         123
events in waitclass Other                             3           0           0.87        .289         186


Looking good – if you don’t keep any extents you don’t need to make sure that their bitmaps are clean. (The “db file sequential read” waits are almost all about the data dictionary, following on from my “flush buffer cache”).

Footnote 1: the same effect appears in 12.1.0.2
Footnote 2: it’s interesting to note that the RO enqueue wait time seems to parallel the local write wait time: perhaps a hint that there’s some double counting going on. (To be investigated, one day).

August 19, 2014

LOB Length

Filed under: Infrastructure,LOBs,Oracle,Performance — Jonathan Lewis @ 6:06 pm GMT Aug 19,2014

It’s funny how you can make little savings in work all over the place in Oracle if you’re prepared to look a little closely at what’s going on. Here’s a quirky little example with LOBs and function calls that might just have some greater relevance in other situations. Here’s a little data set, and two queries that I might run against it:


create table tbl(
	c1      clob
)
lob (c1) store as c_lob(
	disable storage in row
	nocache nologging
)
;

begin
	for i in 1..128 loop
		insert into tbl values(rpad('x',4000));
		commit;
	end loop;
end;
/

commit;

-- collect stats, prime dictionary cache and library cache
-- run both queries twice and check stats on second run

select
	round(sum(ceil(len/8100)) * 8/1024,0)    used_mb
from
	(
	select
		/*+ no_merge */
		dbms_lob.getlength(c1) len
	from
		tbl
	)
where
	len > 3960
;

select
	round(sum(ceil(len/8100)) * 8/1024,0)    used_mb
from
	(
	select
		rownum rn, dbms_lob.getlength(c1) len
	from
		tbl
	)
where
	len > 3960
;

The question that you might ask yourselves when you see these queries is: will they do similar amounts of work. Of course, I wouldn’t be asking the question if the answer were yes. Despite the no_merge() hint, which you might think would have the same effect as the rownum approach, Oracle seems to execute the call to dbms_lob.getlength() twice for each row in the first query, but only once per row for the second query. Here are the stats (from autotrace) on the second run of the two queries when autotrace is enabled:


Statistics (for no_merge)
----------------------------------------------------------
         40  recursive calls
          0  db block gets
        271  consistent gets
          0  physical reads
          0  redo size
        541  bytes sent via SQL*Net to client
        544  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Statistics (for rownum)
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        131  consistent gets
          0  physical reads
          0  redo size
        541  bytes sent via SQL*Net to client
        544  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

As you can see, the consistent gets for the no_merge() approach is roughly double that for the rownum approach – and since we have 128 rows/LOBs in the table that looks suspiciously like 2 gets vs. 1 get per LOB depending on the approach – which suggests two calls to the function. This is further corroborated by the execution plans, and especially by the predicate sections (how often have I said “always check the predicates”) which show that the predicate has been pushed inside the view that’s been hinted to be non-mergeable, but it hasn’t been pushed inside the view that uses the rownum instantion trick:


Execution Plan for no_merge()
----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |    13 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |    13 |            |          |
|   2 |   VIEW              |      |     6 |    78 |     2   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| TBL  |     6 |   522 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("DBMS_LOB"."GETLENGTH"("C1")>3960)

Execution Plan for rownum
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     1 |    13 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE      |      |     1 |    13 |            |          |
|*  2 |   VIEW               |      |   128 |  1664 |     2   (0)| 00:00:01 |
|   3 |    COUNT             |      |       |       |            |          |
|   4 |     TABLE ACCESS FULL| TBL  |   128 | 11136 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("LEN">3960)

My first thought on seeing this difference was to apply the /*+ no_push_pred */ hint to block predicate pushing – but then I remembered that the hint is about join predicate pushing and this is a simple filter push. A quick search of the hidden parameters, though, revealed this:

_optimizer_filter_pushdown : enable/disable filter predicate pushdown

Setting this parameter to false – either through a call to ‘alter session’ or through an /*+ opt_param( opt_param(‘_optimizer_filter_pushdown’ , ‘false’) */ hint – allowed the no_merge approach to produce the same plan and resource usage as the rownum approach. Of course, for a production system, I’d probably use the rownum approach rather than mess around with hidden parameters.

Footnote:

I don’t know why the code with the no_merge() approach reported 40 recursive calls (on its first execution with autotrace). A couple of variations on the experiment suggested that it had something to do with the number of rows (or consequential buffer visits) that survived the predicate call – for a sufficiently small number of rows the recursive call count happened to drop to zero; but the phenomenon needs further investigation.

July 13, 2014

Deferrable RI – 2

Filed under: Infrastructure,Locks,Oracle,Troubleshooting — Jonathan Lewis @ 7:46 pm GMT Jul 13,2014

A question came up on Oracle-L recently about possible locking anomalies with deferrable referential integrity constraints.

An update by primary key is taking a long time; the update sets several columns, one of which is the child end of a referential integrity constraint. A check on v$active_session_history shows lots of waits for “enq: TX – row lock contention” in mode 4 (share), and many of these waits also identify the current object as the index that has been created to avoid the “foreign key locking” problem on this constraint (though many of the waits show the current_obj# as -1). A possible key feature of the issue is that foreign key constraint is defined as “deferrable initially deferred”. The question is, could such a constraint result in TX/4 waits.

My initial thought was that if the constraint was deferrable it was unlikely, there would have to be other features coming into play.

Of course, when the foreign key is NOT deferrable it’s easy to set up cases where a TX/4 appears: for example you insert a new parent value without issuing a commit then I insert a new matching child, at that point my session will wait for your session to commit or rollback. If you commit my insert succeeds if you rollback my session raises an error (ORA-02291: integrity constraint (schema_name.constraint_name) violated – parent key not found). But if the foreign key is deferred the non-existence (or potential existence, or not) of the parent should matter.  If the constraint is deferrable, though, the first guess would be that you could get away with things like this so long as you fixed up the data in time for the commit.

I was wrong. Here’s a little example:


create table parent (
	id	number(4),
	name	varchar2(10),
	constraint par_pk primary key (id)
)
;

create table child(
	id_p	number(4)
		constraint chi_fk_par
		references parent
		deferrable initially deferred,
	id	number(4),
	name	varchar2(10),
	constraint chi_pk primary key (id_p, id)
)
;

insert into parent values (1,'Smith');
insert into parent values (2,'Jones');

insert into child values(1,1,'Simon');
insert into child values(1,2,'Sally');

insert into child values(2,1,'Jack');
insert into child values(2,2,'Jill');

commit;

begin
	dbms_stats.gather_table_stats(user,'parent');
	dbms_stats.gather_table_stats(user,'child');
end;
/

pause Press return

update child set id_p = 3 where id_p = 2 and id = 2;

If you don’t do anything after the pause and before the insert then the update will succeed – but fail on a subsequent commit unless you insert parent 3 before committing. But if you take advantage of the pause to use another session to insert parent 3 first, the update will then hang waiting for the parent insert to commit or rollback – and what happens next may surprise you. Basically the deferrability doesn’t protect you from the side effects of conflicting transactions.

The variations on what can happen next (insert the parent elsewhere, commit or rollback) are interesting and left as an exercise.

I was slightly surprised to find that I had had a conversation about this sort of thing some time ago, triggered by a comment to an earlier post. If you want to read a more thorough investigation of the things that can happen and how deferrable RI works then there’s a good article at this URL.

 

June 17, 2014

Cluster Nulls

Filed under: clusters,Indexing,Infrastructure,NULL,Oracle — Jonathan Lewis @ 7:39 am GMT Jun 17,2014

Yesterday’s posting was a reminder that bitmap indexes, unlike B-tree indexes in Oracle,  do store entries where every column in the index is null. The same is true for cluster indexes – which are implemented as basic B-tree indexes. Here’s a test case I wrote to demonstrate the point.

drop table tc1;
drop cluster c including tables;

purge recyclebin;

create cluster c(val number);
create index c_idx on cluster c;
create table tc1 (val number, n1 number, padding varchar2(100)) cluster c(val);

insert into tc1
select
	decode(rownum,1,to_number(null),rownum), rownum, rpad('x',100)
from
	all_objects
where
	rownum <= 100
;

insert into tc1 select * from tc1;
insert into tc1 select * from tc1;
insert into tc1 select * from tc1;
insert into tc1 select * from tc1;
insert into tc1 select * from tc1;
commit;

analyze cluster c compute statistics;
execute dbms_stats.gather_table_stats(user,'tc1');

set autotrace traceonly explain

select * from tc1 where val = 2;
select * from tc1 where val is null;

set autotrace off

Here are the two execution plans from the above queries:


------------------------------------------------------------------------------
| Id  | Operation            | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |    32 |  3424 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS CLUSTER| TC1   |    32 |  3424 |     1   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN  | C_IDX |     1 |       |     0   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("VAL"=2)

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |    32 |  3424 |    14   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TC1  |    32 |  3424 |    14   (0)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("VAL" IS NULL)

Spot the problem: the second query doesn’t use the index. So, despite the fact that I said that fully null index entries are stored in cluster indexes you might (as the first obvious question) wonder whether or not I was right. So here’s a piece of the symbolic dump of the index.


kdxconro 100                -- Ed: 100 entries (rows) in the leaf block

row#0[8012] flag: -----, lock: 0, data:(8):  02 00 02 0b 00 00 01 00
col 0; len 2; (2):  c1 03

row#1[7999] flag: -----, lock: 0, data:(8):  02 00 02 0c 00 00 01 00
col 0; len 2; (2):  c1 04

...

row#98[6738] flag: -----, lock: 2, data:(8):  02 00 02 6d 00 00 01 00
col 0; len 2; (2):  c2 02

row#99[8025] flag: -----, lock: 0, data:(8):  02 00 02 0a 00 00 01 00
col 0; NULL

The NULL entry is right there – sorted as the high value – just as it is in a bitmap index. But the optimizer won’t use it, even if you hint it.

Of course, Oracle uses it internally when inserting rows (how else, otherwise, would it rapidly find which the heap block needed for the next NULL insertion) – but it won’ use it to retrieve the data, it always uses a full table (cluster) scan. That’s really a little annoying if you’ve made the mistake of allowing nulls into your cluster.

There is a workaround, of course – in this case (depending on version) you could create a virtual column with index or a function-based index that (for example) uses the nvl2() function to convert nulls to a know value and all non-null entries to null; this would give you an index on just the original nulls which would be as small as possible and also have a very good clustering_factor. You’d have to change the code from “column is not null” to a predicate that matched the index, though. For example:


create index tc1_null on tc1(nvl2(val,null,0));

execute dbms_stats.gather_table_stats(user,'tc1', method_opt=>'for all hidden columns size 1');

select * from tc1 where nvl2(val,null,0) = 0;

----------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |    32 |  3456 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TC1      |    32 |  3456 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | TC1_NULL |    32 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(NVL2("VAL",NULL,0)=0)

I’ve included the compress keyword (which implicitly means “all columns” for a non-unique index) in the definition since you only need 4 repetitions of the single-byte entry that is the internal representation of zero before you start to win on the storage trade-off – but since it’s likely to be a small index anyway that’s not particularly important. (Reminder: although Oracle collects index stats on index creation, you still need to collect column stats on the hidden column underlying the function-based columns on the index to give the optimizer full information.)

June 16, 2014

Bitmap Nulls

Filed under: bitmaps,Indexing,Infrastructure,NULL,Oracle — Jonathan Lewis @ 9:08 am GMT Jun 16,2014

It’s fairly well known that in Oracle B-tree indexes on heap tables don’t hold entries where all the indexed columns are all null, but that bitmap indexes will hold such entries and execution plans can for predicates like “column is null” can use bitmap indexes. Here’s a little test case to demonstrate the point (I ran this last on 12.1.0.1):


create table t1 (val number, n1 number, padding varchar2(100));

insert into t1
select
	decode(rownum,1,to_number(null),rownum), rownum, rpad('x',100)
from
	all_objects
where
	rownum <= 1000
;

insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;
insert into t1 select * from t1;

commit;

create bitmap index t1_b1 on t1(val);

execute dbms_stats.gather_table_stats(user,'t1');

set autotrace traceonly explain

select * from t1 where val is null;

set autotrace off

The execution plan for this query, in the system I happened to be using, looked like this:


Execution Plan
----------------------------------------------------------
Plan hash value: 1201576309

---------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |    32 |  3488 |     8   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |    32 |  3488 |     8   (0)| 00:00:01 |
|   2 |   BITMAP CONVERSION TO ROWIDS       |       |       |       |            |          |
|*  3 |    BITMAP INDEX SINGLE VALUE        | T1_B1 |       |       |            |          |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("VAL" IS NULL)

Note that the predicate section shows us that we used the “column is null” predicate as an access predicate into the index.

Of course, this is a silly little example – I’ve only published it to make a point and to act as a reference case if you ever need to support a claim. Normally we don’t expect a single bitmap index to be a useful way to access a table, we tend to use combinations of bitmap indexes to combine a number of predicates so that we can minimise the trips to a table as efficiently as possible. (And we certainly DON’T create a bitmap index on an OLTP system because it lets us access NULLs by index — OLTP and bitmaps don’t get on well together.)

If you do a symbolic block dump, by the way, you’ll find that the NULL is represented by the special “length byte” of 0xFF with no following data.

April 29, 2014

Bigfiles

Filed under: Bugs,Infrastructure,Oracle — Jonathan Lewis @ 2:12 pm GMT Apr 29,2014

It’s always the combinations that catch you out.

Bigfile tablespaces have their uses – especially in big systems

Materialized views have their users – especially in big systems

There’s absolutely no reason why the two technologies should interfere with each other … until you find a bug !

Running an example, stripped to the bare minimum, and doing a couple of things that I personally don’t like doing, on 11.2.0.4:

drop materialized view t1_mv;
drop table t1;

create table t1 (
	id	number constraint t1_pk primary key,
	n1	number,
	n2	number
)
tablespace bigfile_ts
;

create materialized view log on t1
with
	rowid		-- ugh!
;

create materialized view t1_mv
refresh fast on demand
with rowid		-- ugh!
enable query rewrite
as
select
	id, n1
from
	t1
where
	n2 is not null
;

-- Another ugh coming up !
insert into t1 values(1,null,null);
update t1 set n1 = 2, n2 = 1 where id = 1;
commit;

begin
        dbms_mview.refresh(
                list           => 'T1_MV',
                method         => 'F',
                atomic_refresh  => true
        );
end;
/

Things I don’t like:

  • Doing anything that is strongly dependent on rowid – I’d rather do my materialized view stuff by primary key … but, as Mick Jagger once told us: “You can’t always get what you want”.
  • Processes that insert an “empty” row and then update it – it’s very inefficient, generates excessive undo and redo, and often leads to row migration

Most significantly I don’t like operations that have worked for ages suddenly crashing when someone adds a new piece of code to the system.

If you run the fragment about, the refresh will fail with the follow string of errors:


ERROR at line 1:
ORA-12008: error in materialized view refresh path
ORA-00001: unique constraint (TEST_USER.I_SNAP$_T1_MV) violated
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2809
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 3025
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2994
ORA-06512: at line 2

When you try to find out what the unique constraint is, it’s based on a hidden column (M_ROW$$) of type rowid that has been added to the materialized view to emulate a primary key; and, for some reason, if your base table is in a bigfile tablespace and you insert a row and then update it before you’ve run a refresh then you can’t do a fast refresh again until you clean up the mess (e.g. with a complete refresh).

Note – if you complete all your inserts, then refresh, you can update a row as many times as you like without causing subsequent refresh problems – it looks like it’s just “insert it, update it, refresh” (or “insert it, delete it, refresh”) that has a problem because the only way to hit the issue is to get a correct rowid (from an update or delete) and an incorrect rowid (from an insert) for the same row in the log at the same time.

Workaround: if you move the base table to a smallfile tablespace you might just find you get what you need.

This shows up on MoS as bug 17603987, fixed in 12.2.

If you want to dig in a little bit more, you can start by comparing the contents of t1 and mlog$_t1 (the materialized view log table). Critically:

SQL> select rowid from t1;

ROWID
------------------
AAAXN2AAAAAAAJEAAA

1 row selected.

SQL> select m_row$$, dmltype$$ from mlog$_t1;

M_ROW$$                          D
-------------------------------- -
AAAXN2AQAAAAAJEAAA               I
AAAXN2AAAAAAAJEAAA               U

2 rows selected.

SQL> select rowidtochar(m_row$$) m_row$$, dmltype$$ from mlog$_t1;

M_ROW$$                          D
-------------------------------- -
AAAXN2AAAAAAAJEAAA               I
AAAXN2AAAAAAAJEAAA               U

2 rows selected.

If you compare the rowid from t1 with the m_row$$ column (which should hold the rowids of rows from t1) from mlog$_t1 you’ll notice that the insert and the update have been given different values for the source row. Somehow the logging code for the insert has generated the wrong rowid value; however, when you apply a rowidtochar() to the wrong value the return value is the right value – and the merge statement that transfers modified data from the source table into the materialized view uses the rowidtochar() function to join the materialized view log back to the source table, with the result that the same rowid can be inserted twice – except the attempt results in a “duplicate key” error.

 

March 20, 2014

RLS bug

Filed under: Bugs,Infrastructure,Oracle — Jonathan Lewis @ 1:21 pm GMT Mar 20,2014

RLS – row level security, aka VPD (virtual private database) or FGAC (fine grained access control) has a critical bug in 11g. The bug is unpublished, but gets mentioned in various other documents, so can be identified as: Bug: 7828323 “SYS_CONTEXTS RETURNS WRONG VALUE WITH SHARED_CONTEXT_SENSITIVE”

The title tells you nearly everything you need to know – if you’ve declared a security policy as context_sensitive or shared_context_sensitive then a change to the context ought to result in the associated predicate function being called to generate a new security predicate the next time the policy becomes relevant. Thanks to bug 7828323 this doesn’t always happen – so queries can return the wrong set of results.

There are some patches for older versions (11.1.0.7 and 11.2.0.2 as far as I’ve checked), but if you don’t have, or can’t get, a patch the “workaround” is to change any relevant policies to dynamic; unfortunately the consequence of this is that the predicate function will then be called for every execution of any statement against any objects protected by that policy.

Depending on how your application has been written and how many queries are likely to invoke security policies this could easily increase your CPU usage by several percent (and if it’s a badly written application maybe a lot more).

Footnote:

It has occurred to me to wonder what happens if you have use a (normal) pl/sql function in a select list, and the function executes a statement against a table, and the table is protected by a context_sensitive security policy – and you decide to use the pl/sql result cache on the function. How long is an item supposed to stay in the result cache, and if it’s longer than a single execution of a statement will the result cache be invalidated if your context changes in a way that invalidates the current security predicate ? No time to check or test at present, though, but I’d be very cautious about putting RLS predicate functions into the result cache until I’ve played around with that idea for a bit.

March 17, 2014

Temporary Segments

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 6:25 pm GMT Mar 17,2014

A question came up on oracle-l recently about identifying which temporary segment in v$tempseg_usage is associated with which global temporary table. Here’s my answer:

v$tempseg_usage is a synonym for v$sort_usage – which is the view that I still tend to think of first.

v$sort_usage sits on top of x$ktsso – and in x$ktsso the column ktssoobjn is the object number for the definition of the global temporary table (ditto for any indexes on the table). [Addendum: I didn't point it out in the reply, but if you have indexes on the GTT the v$lock will show a TM lock for the table object_id, and TO locks that use the same object id for the TO locks representing the table and its indexes.]

If an internal GTT has appeared because of subquery factoring this is a little broken, but if you look in the execution plan (v$sql_plan / display_cursor) for the query you will find lines like:

|  63 |    HASH JOIN                          |                            |
|  64 |     VIEW                              |                            |
|  65 |      TABLE ACCESS FULL                | SYS_TEMP_0FD9D667C_74A306D |
|  66 |     VIEW                              |                            |
|  67 |      TABLE ACCESS FULL                | SYS_TEMP_0FD9D667B_74A306D |

We will see corresponding TO locks in v$lock (note the negative value):

ADDR             KADDR                   SID TY        ID1        ID2
---------------- ---------------- ---------- -- ---------- ----------
000000008ED8EC68 000000008ED8ECC0        143 TO  -40016261          1
000000008ED8F540 000000008ED8F598        143 TO  -40016260          1

And we note large values for ktssoobjn in x$ktsso:

  KTSSOBNO  KTSSOEXTS  KTSSOBLKS  KTSSORFNO  KTSSOOBJD  KTSSOOBJN KTSSOTSNUM KTSSOSQLID
---------- ---------- ---------- ---------- ---------- ---------- ---------- -------------
     49792          1        128          1    4244096 4254951035          3 anb4mscz7wz71
     49152          1        128          1    4243456 4254951036          3 anb4mscz7wz71

Note the middle of the SYS_TEMP name:

0x0FD9D667C = 4254951036

and then spot the arithmetic:

4254951036 + 40016260 = 4294967296 = 0x100000000 = power(2,32)

v$lock.id1 is a 16 bit wrap of x$ktsso.ktssoobjn, so add it to power(2,32) and you can look for it (for the relevant session) in x$ktsso.

For things like sorts we can check v$sql_workarea_active against v$sort_usage / v$tempseg_usage (though you might stick with x$ktsso to be consistent) since both hold the relative file and block number for the segment.

Follow-up

Shortly after posting a reply to the original question I received an email from Stefan Koehler pointing out that in 11.2.0.2 the object x$ktsso was enhanced to include the SQL_ID of the statement that caused a temporary object to come into existence. (If you check the definition of v$sort_usage/v$tempseg_usage you’ll see that the SQL Id that it’s reporting is the prev_sql_id from V$session).  Stefan also gave me a link to a note that he had written on the topic.

March 14, 2014

12c Temporary

Filed under: 12c,Infrastructure,Oracle,undo — Jonathan Lewis @ 6:41 pm GMT Mar 14,2014

Just one of those little snippets to cover something new and remind of something old. A single session can now have three (or more) temporary tablespaces in use at the same time for different reasons.

  • In 12c you can set parameter temp_undo_enabled to true, at which point the undo relating to global temporary tables (GTTs) will be written into the database default temporary tablespace, which means it won’t generate redo. As an interesting side effect this also means that you can do DML on temporary objects in a physical standby database. Currently the undo segment shows as type “UNDEFINED” in v$sort_usage. A detail to watch out for, though – it looks as if each session may get its own temporary undo segment – so be careful about specifying the extent size and tempfile size for the database default temporary tablespace.
  • In 11g you can specify a tablespace (though not a tablespace group) in the ‘create global temporary table’ statement, which means you keep activity about GTTs separated from the I/O resulting from sorts and hash joins etc. spilling to disc; in fact you could specify a different temporary tablespace for every GTT if you wanted to – and I could imagine a case for having a couple of different temporary tablespaces to accommodate GTTs with very different usage characteristics. (Unfortunately you still can’t specify a tablespace in the dbms_lob.create_temporary() function).  If you don’t specify a tablespace for a GTT it will go into the default temporary tablespace of the user who is using it (not the database default, and not the default for the definer of the GTT). If you create indexes on a GTT they will automatically go into the same tablespace as the table.
  • Finally, of course, there’s the default temporary tablespace for the user and this is where GTTs will go if they don’t have a tablespace specified, and where all the scratch data (sorts, hash tables, factored subqueries et. al.) will go.

This combination means, of course, that you could manage to do a single “insert as select” writing a GTT to one temporary tablespace, with its undo going to a second temporary tablespace, and the spill from a sort or hash join in the select going to a third. The flexibility probably won’t make much difference to performance (for most people), but it’s possible that it will make it easier to monitor where the work is coming from if you’r ever in the position where your single temporary tablespace is subject to a lot of I/O.

Footnote:

In the past I’ve advised DBAs to set up a small number of tablespaces (or tablespace groups) so that they can allocate different classes of users – typically grouped by business function – to different temporary tablespaces. The ability to allocate GTTs to temporary tablespaces allows a further degree of refinement to this strategy.

March 13, 2014

Shrink Space

Filed under: Bugs,deadlocks,fragmentation,Index Rebuilds,Indexing,Locks,Oracle — Jonathan Lewis @ 7:08 am GMT Mar 13,2014

Here’s a lovely effect looking at v$lock (on 11.2.0.4)

select
        sid, type, id1, id2, lmode, request, ctime, block
from
        V$lock
where   sid in (
                select  sid
                from    V$session
                where   username = 'TEST_USER'
        )
order by
        sid, type desc
;

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
       145 TX     589833       7518          6          0        489          0
           TM      92275          0          2          6        489          1
           SK          7   25165955          6          0        489          0
           AE        100          0          4          0        582          0

       148 TX     524308       7383          6          0        490          0
           TM      92275          0          2          6        490          1
           SK          7   25173379          6          0        490          0
           AE        100          0          4          0        998          0

You’ll notice I’ve got two sessions holding a TM lock on the same table (object 92275) in mode 2 (sub-share) and waiting for an exclusive lock on the same table. Both sessions are recording the fact that they are blocking something else. You’ll have trust me when I say there are no other user sessions on the system at this point, and none of the background sessions is doing anything with that table.

The clue to what’s happening is the SK lock – it’s the “segment shrink” lock. I had two sessions start an “alter index I_n shrink space” (two different indexes on the same table) at the same time. The problem is that “shrink space” without the “compact” tries to drop the highwater mark on the index’ allocated space after completing the shrink phase – and there’s a defect in the internal code that tries to get the required exclusive lock on the underlying table: it doesn’t seem to allow for all the possible ways you can fail to get the lock. If you look at v$session_wait_history for either of these sessions, you’ll see something like the following:


  SID  SEQ#     EVENT# EVENT                            P1         P2         P3  WAIT_TIME WAIT_TIME_MICRO TIME_SINCE_LAST_WAIT_MICRO
----- ----- ---------- ------------------------ ---------- ---------- ---------- ---------- --------------- --------------------------
  145     1        798 Wait for shrink lock              0          0          0         10           99954                      70137
          2        235 enq: TM - contention     1414332422      92275          0        600         6002022                        101
          3        798 Wait for shrink lock              0          0          0         10          100723                      69335
          4        235 enq: TM - contention     1414332422      92275          0        600         6001589                         81
          5        798 Wait for shrink lock              0          0          0         10          100500                      69542
          6        235 enq: TM - contention     1414332422      92275          0        600         6002352                         86
          7        798 Wait for shrink lock              0          0          0         10          100618                      69145
          8        235 enq: TM - contention     1414332422      92275          0        600         6001545                        144
          9        798 Wait for shrink lock              0          0          0         10          100996                      69445
         10        235 enq: TM - contention     1414332422      92275          0        600         6002744                        310

The attempt to acquire the TM enqueue (lock) times out every three seconds – and I think the session then releases and re-acquires the SK lock before trying to re-acquire the TM lock – and it’s never going to get it.

The alert log, by the way, looked like this:


Wed Mar 12 12:53:27 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
Wed Mar 12 12:53:30 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
Wed Mar 12 12:53:40 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
Wed Mar 12 12:53:43 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.

I’ve said it before, and I keep repeating it when people say “Oracle resolves deadlocks automatically”: Oracle does NOT resolve deadlocks automatically – one of the sessions will rollback its last DML statement to clear the deadlock, but the other session will (almost invariably) still be waiting. It’s up to the application to do something sensible to resolve the deadlock after it receives the ORA-00060 error.

Don’t ask about the trace files – but they had both reached 400MB by the time I finished this note.

Strategy Note

If you are going to shrink objects, it’s probably best to do it in two steps: “shrink space compact, followed by “shrink space”.
If you’re going to try to use multiple sessions to shrink several indexes as quickly as possible, make sure there’s no way that two sessions can try to “shrink space” on the indexes on the same table at the same time.

Footnote:

This article was prompted by the recent appearance of bug 18388128 on MoS.

March 10, 2014

Duplicate database

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 6:19 pm GMT Mar 10,2014

One of the people attending my seminar in Munich last week has emailed me some details about a nasty little surprise you may get if you’re thinking about TSPITR (tablespace point in time recovery), and happen to have a few materialized views in your database.

You might have wanted to use the “duplicate target database”  from rman with the “skip tablespace” option to create a duplicate database, excluding tablespaces that you’re not interested in, if you needed to recover a particular tablespace (or set of tablespaces) to a specific point in time.  (Once you’ve done that you can extract the data that you want and then discard the duplicate database). But there’s a requirement for recovery is that the set of tablespaces should be “self-contained”; but what does “self-contained” mean ?

Amongst other things it means that none of the tablespace you skip should contain materialized views. Note carefully, that’s not saying the schema you’re interested in shouldn’t have created any materialized view, or the tablespace you want to duplicate contains a base table for a materialized view in another table; it really does mean – if you’ve got ANY materialized view ANYWHERE in the database, you have to duplicate those tablespaces as part of the process.

Here’s the restriction note from MoS (note the exclamation mark – maybe the Oracle analyst was surprised too):

Restriction Note: 
You MUST NOT exclude 
- SYS-owned objects 
- or tablespaces with rollback segments, 
- nor tablespaces containing “MATERIALIZED VIEWS”! 

Implementation suggestion – always put materialized views (and materialized view logs, and indexes on materialized views) in their own tablespace(s), just in case one day you want to do a tablespace point in time recovery and find you’ve got a few bits of materialized views scattered all around your database.

Footnote:

When I first heard this comment I didn’t believe it (but take a look at MoS document ID: 1287276.1 if you’re finding it hard to believe). Naturally my engine of doom went into overdrive immediately after I was convinced and made me wonder what would happen in a 12c container database with several plugged databases; obviously the existence of a materialized view in one plugged database shouldn’t have any impact on TSPITR for another pluggable database – but I wouldn’t mind if someone tested the hypothesis and reported back what they found.

Update:

It’s just occurred to me that this type of TSPITR problem simply won’t exist if you’re using Delphix as part of your working environment.

March 2, 2014

Auto Sample Size

Filed under: Function based indexes,Indexing,Infrastructure,IOT,LOBs,Oracle,Statistics — Jonathan Lewis @ 6:38 pm GMT Mar 2,2014

In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:


create bitmap index i_s_rmp_eval_csc_msg_actions on
    s_rmp_evaluation_csc_message (
        decode(instr(xml_message_text,' '),0,0,1)
    )
;

As you might guess from the column name, this is an index based on an XML column, which is stored as a CLOB.

In a similar vein, I showed you a few days ago an old example I had of indexing a CLOB column with a call to dbms_lob.getlength(). Both index examples suffer from the same problem – to support the index Oracle creates a hidden (virtual) column on the table that can be used to hold statistics about the values of the function; actual calculated values for the function call are stored in the index but not on the table itself – but it’s important that the optimizer has the statistics about the non-existent column values.

So what happens when Oracle collects table statistics – if you’ve enable the approximate NDV feature Oracle does a 100% sample, which means it has to call the function for every single row in the table. You will appreciate that the decode(instr()) function on the LOB column is going to read every single LOB in turn from the table – it’s not surprising that the time taken to calculate stats on the table jumped from a few minutes to a couple of hours. What did surprise me was that my call to dbms_lob.getlength() also seemed to read every lob in my example rather than reading the “LOB Locator” data that’s stored in the row – one day I’ll take a look into why that happened.

Looking at these examples it’s probably safe to conclude that if you really need to index some very small piece of “flag” information from a LOB it’s probably best to store it as a real column on the table – perhaps populated through a trigger so you don’t have to trust every single piece of front-end code to keep it up to date. (It would be quite nice if Oracle gave us the option for a “derived” column – i.e. one that could be defined in the same sort of way as a virtual column, with the difference that it should be stored in the table.)

So virtual columns based on LOBs can create a performance problem for the approximate NDV mechanism;  but the story doesn’t stop there because there’s another “less commonly used” feature of Oracle that introduces a different threat – with no workaround – it’s the index organized table (IOT). Here’s a basic example:

create table iot1 (
        id1	number(7,0),
	id2	number(7,0),
	v1	varchar2(10),
	v2	varchar2(10),
	padding	varchar2(500),
        constraint iot1_pk primary key(id1, id2)
)
organization index
including id2
overflow
;

insert into iot1
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
)
select
        mod(rownum,20)                  id1,
        trunc(rownum,100)               id2,
        to_char(mod(rownum,20))         v1,
        to_char(trunc(rownum,100))      v2,
        rpad('x',500,'x')               padding
from
	generator	v1,
	generator	v2
where
	rownum <= 1e5
;

commit;

alter system flush buffer_cache;

alter session set events '10046 trace name context forever';

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 =>'IOT1',
		method_opt	 => 'for all columns size 1'
	);
end;
/

alter session set events '10046 trace name context off';

You’ll notice I’ve created the table then inserted the data – if I did a “create table as select” Oracle would have sorted the data before inserting it, and that would have helped to hide the problem I’m trying to demonstrate. As it is my overflow segment is very badly ordered relative to the “top” (i.e. index) segment – in fact I can see after I’ve collected stats on the table that the clustering_factor on the index is 100,000 – an exact match for the rows in the table.

Running 11.2.0.4, with a 1MB uniform extent, freelist management, and 8KB block size the index segment held 279 leaf blocks, the overflow segment (reported in view user_tables as SYS_IOT_OVER_81594) held 7,144 data blocks.

So what interesting things do we find in a 10046 trace file after gathering stats – here are the key details from the tkprof results:

SQL ID: 7ak95sy9m1s4f Plan Hash: 1508788224

select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats
  cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
  no_substrb_pad  */to_char(count("ID1")),to_char(substrb(dump(min("ID1"),16,
  0,32),1,120)),to_char(substrb(dump(max("ID1"),16,0,32),1,120)),
  to_char(count("ID2")),to_char(substrb(dump(min("ID2"),16,0,32),1,120)),
  to_char(substrb(dump(max("ID2"),16,0,32),1,120)),to_char(count("V1")),
  to_char(substrb(dump(min("V1"),16,0,32),1,120)),
  to_char(substrb(dump(max("V1"),16,0,32),1,120)),to_char(count("V2")),
  to_char(substrb(dump(min("V2"),16,0,32),1,120)),
  to_char(substrb(dump(max("V2"),16,0,32),1,120)),to_char(count("PADDING")),
  to_char(substrb(dump(min("PADDING"),16,0,32),1,120)),
  to_char(substrb(dump(max("PADDING"),16,0,32),1,120))
from
 "TEST_USER"."IOT1" t  /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,
  NIL,NIL*/

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.37       0.37       7423     107705          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.37       0.37       7423     107705          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=107705 pr=7423 pw=0 time=377008 us)
    100000     100000     100000   APPROXIMATE NDV AGGREGATE (cr=107705 pr=7423 pw=0 time=426437 us cost=10 size=23944 card=82)
    100000     100000     100000    INDEX FAST FULL SCAN IOT1_PK (cr=107705 pr=7423 pw=0 time=298380 us cost=10 size=23944 card=82)(object id 85913)

********************************************************************************

SQL ID: 1ca2ug8s3mm5z Plan Hash: 2571749554

select /*+  no_parallel_index(t, "IOT1_PK")  dbms_stats cursor_sharing_exact
  use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad
  no_expand index(t,"IOT1_PK") */ count(*) as nrw,count(distinct
  sys_op_lbid(85913,'L',t.rowid)) as nlb,null as ndk,
  sys_op_countchg(sys_op_lbid(85913,'O',"V1"),1) as clf
from
 "TEST_USER"."IOT1" t where "ID1" is not null or "ID2" is not null

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.16       0.16          0     100280          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.16       0.16          0     100280          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT GROUP BY (cr=100280 pr=0 pw=0 time=162739 us)
    100000     100000     100000   INDEX FULL SCAN IOT1_PK (cr=100280 pr=0 pw=0 time=164597 us cost=6 size=5900000 card=100000)(object id 85913)

The first query collects table and column stats, and we can see that the approximate NDV method has been used because of the trailing text: /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/. In this statement the hint /*+ full(t) */ has been interpreted to mean an index fast full scan, which is what we see in the execution plan. Although there are only 279 blocks in the index and 7,144 blocks in the overflow we’ve done a little over 100,000 buffer visits because for every index entry in the IOT top we’ve done a “fetch by rowid” into the overflow segment (the session stats records these as “table fetch continued row”). Luckily I had a small table so all those visits were buffer gets; on a very large table it’s quite possible that a significant fraction of those buffer gets will turn into single block physical reads.

Not only have we done one buffer visit per row to allow us to calculate the approximate NDV for the table columns, we’ve done the same all over again so that we can calculate the clustering_factor of the index. This is a little surprising since the “rowid” for an item in the overflow section is stored in the index segment but (as you can see in the second query in the tkprof output) Oracle has used column v1 (the first in the overflow segment) in the call to the sys_op_countchg() function where the equivalent call for an ordinary index would use t.rowid so, presumably, the code HAS to access the overflow segment. The really strange thing about this is that the same SQL statement has a call to sys_op_lbid() which uses the (not supposed to exist in IOTs) rowid – so it looks as if it ought to be possible for sys_op_countchg() to do the same.

So – big warning on upgrading to 11g: if you’ve got IOTs with overflows and you switch to auto_sample_size and enable approximate NDV then the time taken to gather stats on those IOTs may (depending to a large extent on the data clustering) take much longer than it used to.

Next Page »

Theme: Rubric. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.

Join 4,298 other followers