Oracle Scratchpad

March 20, 2014

RLS bug

Filed under: Bugs,Infrastructure,Oracle — Jonathan Lewis @ 1:21 pm BST Mar 20,2014

RLS – row level security, aka VPD (virtual private database) or FGAC (fine grained access control) has a critical bug in 11g. The bug is unpublished, but gets mentioned in various other documents, so can be identified as: Bug: 7828323 “SYS_CONTEXTS RETURNS WRONG VALUE WITH SHARED_CONTEXT_SENSITIVE”

The title tells you nearly everything you need to know – if you’ve declared a security policy as context_sensitive or shared_context_sensitive then a change to the context ought to result in the associated predicate function being called to generate a new security predicate the next time the policy becomes relevant. Thanks to bug 7828323 this doesn’t always happen – so queries can return the wrong set of results.

There are some patches for older versions ( and as far as I’ve checked), but if you don’t have, or can’t get, a patch the “workaround” is to change any relevant policies to dynamic; unfortunately the consequence of this is that the predicate function will then be called for every execution of any statement against any objects protected by that policy.

Depending on how your application has been written and how many queries are likely to invoke security policies this could easily increase your CPU usage by several percent (and if it’s a badly written application maybe a lot more).


It has occurred to me to wonder what happens if you have use a (normal) pl/sql function in a select list, and the function executes a statement against a table, and the table is protected by a context_sensitive security policy – and you decide to use the pl/sql result cache on the function. How long is an item supposed to stay in the result cache, and if it’s longer than a single execution of a statement will the result cache be invalidated if your context changes in a way that invalidates the current security predicate ? No time to check or test at present, though, but I’d be very cautious about putting RLS predicate functions into the result cache until I’ve played around with that idea for a bit.

March 17, 2014

Temporary Segments

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 6:25 pm BST Mar 17,2014

A question came up on oracle-l recently about identifying which temporary segment in v$tempseg_usage is associated with which global temporary table. Here’s my answer:

v$tempseg_usage is a synonym for v$sort_usage – which is the view that I still tend to think of first.

v$sort_usage sits on top of x$ktsso – and in x$ktsso the column ktssoobjn is the object number for the definition of the global temporary table (ditto for any indexes on the table). [Addendum: I didn't point it out in the reply, but if you have indexes on the GTT the v$lock will show a TM lock for the table object_id, and TO locks that use the same object id for the TO locks representing the table and its indexes.]

If an internal GTT has appeared because of subquery factoring this is a little broken, but if you look in the execution plan (v$sql_plan / display_cursor) for the query you will find lines like:

|  63 |    HASH JOIN                          |                            |
|  64 |     VIEW                              |                            |
|  65 |      TABLE ACCESS FULL                | SYS_TEMP_0FD9D667C_74A306D |
|  66 |     VIEW                              |                            |
|  67 |      TABLE ACCESS FULL                | SYS_TEMP_0FD9D667B_74A306D |

We will see corresponding TO locks in v$lock (note the negative value):

ADDR             KADDR                   SID TY        ID1        ID2
---------------- ---------------- ---------- -- ---------- ----------
000000008ED8EC68 000000008ED8ECC0        143 TO  -40016261          1
000000008ED8F540 000000008ED8F598        143 TO  -40016260          1

And we note large values for ktssoobjn in x$ktsso:

---------- ---------- ---------- ---------- ---------- ---------- ---------- -------------
     49792          1        128          1    4244096 4254951035          3 anb4mscz7wz71
     49152          1        128          1    4243456 4254951036          3 anb4mscz7wz71

Note the middle of the SYS_TEMP name:

0x0FD9D667C = 4254951036

and then spot the arithmetic:

4254951036 + 40016260 = 4294967296 = 0x100000000 = power(2,32)

v$lock.id1 is a 16 bit wrap of x$ktsso.ktssoobjn, so add it to power(2,32) and you can look for it (for the relevant session) in x$ktsso.

For things like sorts we can check v$sql_workarea_active against v$sort_usage / v$tempseg_usage (though you might stick with x$ktsso to be consistent) since both hold the relative file and block number for the segment.


Shortly after posting a reply to the original question I received an email from Stefan Koehler pointing out that in the object x$ktsso was enhanced to include the SQL_ID of the statement that caused a temporary object to come into existence. (If you check the definition of v$sort_usage/v$tempseg_usage you’ll see that the SQL Id that it’s reporting is the prev_sql_id from V$session).  Stefan also gave me a link to a note that he had written on the topic.

March 14, 2014

12c Temporary

Filed under: 12c,Infrastructure,Oracle,undo — Jonathan Lewis @ 6:41 pm BST Mar 14,2014

Just one of those little snippets to cover something new and remind of something old. A single session can now have three (or more) temporary tablespaces in use at the same time for different reasons.

  • In 12c you can set parameter temp_undo_enabled to true, at which point the undo relating to global temporary tables (GTTs) will be written into the database default temporary tablespace, which means it won’t generate redo. As an interesting side effect this also means that you can do DML on temporary objects in a physical standby database. Currently the undo segment shows as type “UNDEFINED” in v$sort_usage. A detail to watch out for, though – it looks as if each session may get its own temporary undo segment – so be careful about specifying the extent size and tempfile size for the database default temporary tablespace.
  • In 11g you can specify a tablespace (though not a tablespace group) in the ‘create global temporary table’ statement, which means you keep activity about GTTs separated from the I/O resulting from sorts and hash joins etc. spilling to disc; in fact you could specify a different temporary tablespace for every GTT if you wanted to – and I could imagine a case for having a couple of different temporary tablespaces to accommodate GTTs with very different usage characteristics. (Unfortunately you still can’t specify a tablespace in the dbms_lob.create_temporary() function).  If you don’t specify a tablespace for a GTT it will go into the default temporary tablespace of the user who is using it (not the database default, and not the default for the definer of the GTT). If you create indexes on a GTT they will automatically go into the same tablespace as the table.
  • Finally, of course, there’s the default temporary tablespace for the user and this is where GTTs will go if they don’t have a tablespace specified, and where all the scratch data (sorts, hash tables, factored subqueries et. al.) will go.

This combination means, of course, that you could manage to do a single “insert as select” writing a GTT to one temporary tablespace, with its undo going to a second temporary tablespace, and the spill from a sort or hash join in the select going to a third. The flexibility probably won’t make much difference to performance (for most people), but it’s possible that it will make it easier to monitor where the work is coming from if you’r ever in the position where your single temporary tablespace is subject to a lot of I/O.


In the past I’ve advised DBAs to set up a small number of tablespaces (or tablespace groups) so that they can allocate different classes of users – typically grouped by business function – to different temporary tablespaces. The ability to allocate GTTs to temporary tablespaces allows a further degree of refinement to this strategy.

March 13, 2014

Shrink Space

Filed under: Bugs,deadlocks,fragmentation,Index Rebuilds,Indexing,Locks,Oracle — Jonathan Lewis @ 7:08 am BST Mar 13,2014

Here’s a lovely effect looking at v$lock (on

        sid, type, id1, id2, lmode, request, ctime, block
where   sid in (
                select  sid
                from    V$session
                where   username = 'TEST_USER'
order by
        sid, type desc

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------- -- ---------- ---------- ---------- ---------- ---------- ----------
       145 TX     589833       7518          6          0        489          0
           TM      92275          0          2          6        489          1
           SK          7   25165955          6          0        489          0
           AE        100          0          4          0        582          0

       148 TX     524308       7383          6          0        490          0
           TM      92275          0          2          6        490          1
           SK          7   25173379          6          0        490          0
           AE        100          0          4          0        998          0

You’ll notice I’ve got two sessions holding a TM lock on the same table (object 92275) in mode 2 (sub-share) and waiting for an exclusive lock on the same table. Both sessions are recording the fact that they are blocking something else. You’ll have trust me when I say there are no other user sessions on the system at this point, and none of the background sessions is doing anything with that table.

The clue to what’s happening is the SK lock – it’s the “segment shrink” lock. I had two sessions start an “alter index I_n shrink space” (two different indexes on the same table) at the same time. The problem is that “shrink space” without the “compact” tries to drop the highwater mark on the index’ allocated space after completing the shrink phase – and there’s a defect in the internal code that tries to get the required exclusive lock on the underlying table: it doesn’t seem to allow for all the possible ways you can fail to get the lock. If you look at v$session_wait_history for either of these sessions, you’ll see something like the following:

  SID  SEQ#     EVENT# EVENT                            P1         P2         P3  WAIT_TIME WAIT_TIME_MICRO TIME_SINCE_LAST_WAIT_MICRO
----- ----- ---------- ------------------------ ---------- ---------- ---------- ---------- --------------- --------------------------
  145     1        798 Wait for shrink lock              0          0          0         10           99954                      70137
          2        235 enq: TM - contention     1414332422      92275          0        600         6002022                        101
          3        798 Wait for shrink lock              0          0          0         10          100723                      69335
          4        235 enq: TM - contention     1414332422      92275          0        600         6001589                         81
          5        798 Wait for shrink lock              0          0          0         10          100500                      69542
          6        235 enq: TM - contention     1414332422      92275          0        600         6002352                         86
          7        798 Wait for shrink lock              0          0          0         10          100618                      69145
          8        235 enq: TM - contention     1414332422      92275          0        600         6001545                        144
          9        798 Wait for shrink lock              0          0          0         10          100996                      69445
         10        235 enq: TM - contention     1414332422      92275          0        600         6002744                        310

The attempt to acquire the TM enqueue (lock) times out every three seconds – and I think the session then releases and re-acquires the SK lock before trying to re-acquire the TM lock – and it’s never going to get it.

The alert log, by the way, looked like this:

Wed Mar 12 12:53:27 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
Wed Mar 12 12:53:30 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
Wed Mar 12 12:53:40 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.
Wed Mar 12 12:53:43 2014
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_1832.trc.
ORA-00060: Deadlock detected. More info in file /u01/app/oracle/diag/rdbms/test/test/trace/test_ora_2242.trc.

I’ve said it before, and I keep repeating it when people say “Oracle resolves deadlocks automatically”: Oracle does NOT resolve deadlocks automatically – one of the sessions will rollback its last DML statement to clear the deadlock, but the other session will (almost invariably) still be waiting. It’s up to the application to do something sensible to resolve the deadlock after it receives the ORA-00060 error.

Don’t ask about the trace files – but they had both reached 400MB by the time I finished this note.

Strategy Note

If you are going to shrink objects, it’s probably best to do it in two steps: “shrink space compact, followed by “shrink space”.
If you’re going to try to use multiple sessions to shrink several indexes as quickly as possible, make sure there’s no way that two sessions can try to “shrink space” on the indexes on the same table at the same time.


This article was prompted by the recent appearance of bug 18388128 on MoS.

March 10, 2014

Duplicate database

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 6:19 pm BST Mar 10,2014

One of the people attending my seminar in Munich last week has emailed me some details about a nasty little surprise you may get if you’re thinking about TSPITR (tablespace point in time recovery), and happen to have a few materialized views in your database.

You might have wanted to use the “duplicate target database”  from rman with the “skip tablespace” option to create a duplicate database, excluding tablespaces that you’re not interested in, if you needed to recover a particular tablespace (or set of tablespaces) to a specific point in time.  (Once you’ve done that you can extract the data that you want and then discard the duplicate database). But there’s a requirement for recovery is that the set of tablespaces should be “self-contained”; but what does “self-contained” mean ?

Amongst other things it means that none of the tablespace you skip should contain materialized views. Note carefully, that’s not saying the schema you’re interested in shouldn’t have created any materialized view, or the tablespace you want to duplicate contains a base table for a materialized view in another table; it really does mean – if you’ve got ANY materialized view ANYWHERE in the database, you have to duplicate those tablespaces as part of the process.

Here’s the restriction note from MoS (note the exclamation mark – maybe the Oracle analyst was surprised too):

Restriction Note: 
You MUST NOT exclude 
- SYS-owned objects 
- or tablespaces with rollback segments, 
- nor tablespaces containing “MATERIALIZED VIEWS”! 

Implementation suggestion – always put materialized views (and materialized view logs, and indexes on materialized views) in their own tablespace(s), just in case one day you want to do a tablespace point in time recovery and find you’ve got a few bits of materialized views scattered all around your database.


When I first heard this comment I didn’t believe it (but take a look at MoS document ID: 1287276.1 if you’re finding it hard to believe). Naturally my engine of doom went into overdrive immediately after I was convinced and made me wonder what would happen in a 12c container database with several plugged databases; obviously the existence of a materialized view in one plugged database shouldn’t have any impact on TSPITR for another pluggable database – but I wouldn’t mind if someone tested the hypothesis and reported back what they found.


It’s just occurred to me that this type of TSPITR problem simply won’t exist if you’re using Delphix as part of your working environment.

March 2, 2014

Auto Sample Size

Filed under: Function based indexes,Indexing,Infrastructure,IOT,LOBs,Oracle,Statistics — Jonathan Lewis @ 6:38 pm BST Mar 2,2014

In the past I have enthused mightily about the benefits of the approximate NDV mechanism and the benefit of using auto_sample_size to collect statistics in 11g; however, as so often happens with Oracle features, there’s a down-side or boundary condition, or edge case. I’ve already picked this up once as an addendum to an earlier blog note on virtual stats, which linked to an article on OTN describing how the time taken to collect stats on a table increased dramatically after the addition of an index – where the index had this definition:

create bitmap index i_s_rmp_eval_csc_msg_actions on
    s_rmp_evaluation_csc_message (
        decode(instr(xml_message_text,' '),0,0,1)

As you might guess from the column name, this is an index based on an XML column, which is stored as a CLOB.

In a similar vein, I showed you a few days ago an old example I had of indexing a CLOB column with a call to dbms_lob.getlength(). Both index examples suffer from the same problem – to support the index Oracle creates a hidden (virtual) column on the table that can be used to hold statistics about the values of the function; actual calculated values for the function call are stored in the index but not on the table itself – but it’s important that the optimizer has the statistics about the non-existent column values.

So what happens when Oracle collects table statistics – if you’ve enable the approximate NDV feature Oracle does a 100% sample, which means it has to call the function for every single row in the table. You will appreciate that the decode(instr()) function on the LOB column is going to read every single LOB in turn from the table – it’s not surprising that the time taken to calculate stats on the table jumped from a few minutes to a couple of hours. What did surprise me was that my call to dbms_lob.getlength() also seemed to read every lob in my example rather than reading the “LOB Locator” data that’s stored in the row – one day I’ll take a look into why that happened.

Looking at these examples it’s probably safe to conclude that if you really need to index some very small piece of “flag” information from a LOB it’s probably best to store it as a real column on the table – perhaps populated through a trigger so you don’t have to trust every single piece of front-end code to keep it up to date. (It would be quite nice if Oracle gave us the option for a “derived” column – i.e. one that could be defined in the same sort of way as a virtual column, with the difference that it should be stored in the table.)

So virtual columns based on LOBs can create a performance problem for the approximate NDV mechanism;  but the story doesn’t stop there because there’s another “less commonly used” feature of Oracle that introduces a different threat – with no workaround – it’s the index organized table (IOT). Here’s a basic example:

create table iot1 (
        id1	number(7,0),
	id2	number(7,0),
	v1	varchar2(10),
	v2	varchar2(10),
	padding	varchar2(500),
        constraint iot1_pk primary key(id1, id2)
organization index
including id2

insert into iot1
with generator as (
	select	--+ materialize
		rownum id
	from dual
	connect by
		level <= 1e4
        mod(rownum,20)                  id1,
        trunc(rownum,100)               id2,
        to_char(mod(rownum,20))         v1,
        to_char(trunc(rownum,100))      v2,
        rpad('x',500,'x')               padding
	generator	v1,
	generator	v2
	rownum <= 1e5


alter system flush buffer_cache;

alter session set events '10046 trace name context forever';

		ownname		 => user,
		tabname		 =>'IOT1',
		method_opt	 => 'for all columns size 1'

alter session set events '10046 trace name context off';

You’ll notice I’ve created the table then inserted the data – if I did a “create table as select” Oracle would have sorted the data before inserting it, and that would have helped to hide the problem I’m trying to demonstrate. As it is my overflow segment is very badly ordered relative to the “top” (i.e. index) segment – in fact I can see after I’ve collected stats on the table that the clustering_factor on the index is 100,000 – an exact match for the rows in the table.

Running, with a 1MB uniform extent, freelist management, and 8KB block size the index segment held 279 leaf blocks, the overflow segment (reported in view user_tables as SYS_IOT_OVER_81594) held 7,144 data blocks.

So what interesting things do we find in a 10046 trace file after gathering stats – here are the key details from the tkprof results:

SQL ID: 7ak95sy9m1s4f Plan Hash: 1508788224

select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats
  cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
  no_substrb_pad  */to_char(count("ID1")),to_char(substrb(dump(min("ID1"),16,

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.37       0.37       7423     107705          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.37       0.37       7423     107705          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=107705 pr=7423 pw=0 time=377008 us)
    100000     100000     100000   APPROXIMATE NDV AGGREGATE (cr=107705 pr=7423 pw=0 time=426437 us cost=10 size=23944 card=82)
    100000     100000     100000    INDEX FAST FULL SCAN IOT1_PK (cr=107705 pr=7423 pw=0 time=298380 us cost=10 size=23944 card=82)(object id 85913)


SQL ID: 1ca2ug8s3mm5z Plan Hash: 2571749554

select /*+  no_parallel_index(t, "IOT1_PK")  dbms_stats cursor_sharing_exact
  use_weak_name_resl dynamic_sampling(0) no_monitoring no_substrb_pad
  no_expand index(t,"IOT1_PK") */ count(*) as nrw,count(distinct
  sys_op_lbid(85913,'L',t.rowid)) as nlb,null as ndk,
  sys_op_countchg(sys_op_lbid(85913,'O',"V1"),1) as clf
 "TEST_USER"."IOT1" t where "ID1" is not null or "ID2" is not null

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      0.16       0.16          0     100280          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      0.16       0.16          0     100280          0           1

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 62     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT GROUP BY (cr=100280 pr=0 pw=0 time=162739 us)
    100000     100000     100000   INDEX FULL SCAN IOT1_PK (cr=100280 pr=0 pw=0 time=164597 us cost=6 size=5900000 card=100000)(object id 85913)

The first query collects table and column stats, and we can see that the approximate NDV method has been used because of the trailing text: /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL*/. In this statement the hint /*+ full(t) */ has been interpreted to mean an index fast full scan, which is what we see in the execution plan. Although there are only 279 blocks in the index and 7,144 blocks in the overflow we’ve done a little over 100,000 buffer visits because for every index entry in the IOT top we’ve done a “fetch by rowid” into the overflow segment (the session stats records these as “table fetch continued row”). Luckily I had a small table so all those visits were buffer gets; on a very large table it’s quite possible that a significant fraction of those buffer gets will turn into single block physical reads.

Not only have we done one buffer visit per row to allow us to calculate the approximate NDV for the table columns, we’ve done the same all over again so that we can calculate the clustering_factor of the index. This is a little surprising since the “rowid” for an item in the overflow section is stored in the index segment but (as you can see in the second query in the tkprof output) Oracle has used column v1 (the first in the overflow segment) in the call to the sys_op_countchg() function where the equivalent call for an ordinary index would use t.rowid so, presumably, the code HAS to access the overflow segment. The really strange thing about this is that the same SQL statement has a call to sys_op_lbid() which uses the (not supposed to exist in IOTs) rowid – so it looks as if it ought to be possible for sys_op_countchg() to do the same.

So – big warning on upgrading to 11g: if you’ve got IOTs with overflows and you switch to auto_sample_size and enable approximate NDV then the time taken to gather stats on those IOTs may (depending to a large extent on the data clustering) take much longer than it used to.

February 21, 2014

Indexing LOBs

Filed under: Function based indexes,Indexing,Infrastructure,LOBs,Oracle — Jonathan Lewis @ 6:42 pm BST Feb 21,2014

Many years ago, possibly when most sites were still using Oracle 8i, a possible solution to a particular customer problem was to create a function-based index on a CLOB column using the dbms_lob.getlength() function call. I can’t find the notes explaining why this was necessary (I usually have some sort of clue – such as the client name – in the script, but in this case all I had was a comment that “the manuals say you can’t do this, but it works provided you wrap the dbms_lob call inside a deterministic function”).

I never worked out why the dbms_lob.getlength() function wasn’t declared as deterministic – especially since it came complete with a most restrictive restricts_references pragma – so I had just assumed there was probably some good reason based on strange side effects when national language charactersets came into play. But here’s a little detail I noticed recently about the dbms_lob.getlength() function: it became deterministic in 11g, so if the client decided to implement my suggestion (which included the usual sorts of warnings) it’s now legal !

Footnote – the length() function has been deterministic and usable with LOBs for a long time, certainly since late 9i, but in 8i length(lob_col) will produce Oracle error “ORA-00932: inconsistent datatypes”

Index Compression – aargh

Filed under: Bugs,compression,Indexing,Infrastructure,Oracle — Jonathan Lewis @ 7:57 am BST Feb 21,2014

The problem with telling people that some feature of Oracle is a “good thing” is that some of those people will go ahead and use it; and if enough people use it some of them will discover a hitherto undiscovered defect. Almost inevitably the bug will turn out to be one of those “combinations” bugs that leaves you thinking: “Why the {insert preferred expression of disbelief here} should {feature X} have anything to do with {feature Y}”.

Here – based on index compression, as you may have guessed from the title – is one such bug. I got it first on, but it’s still there on and

create table t1 (
	id1	number not null,
	id2	number not null,
	n1	number,
	v1	varchar2(10)

alter table t1 add constraint t1_pk primary key(id1, id2) using index (
	create index t1_pk on t1(id1, id2) compress 1

create table t2(
	id1	number not null,
	id2	number not null,
	id3	number not null,
	n1	number,
	v1	varchar2(10)

alter table t2 add constraint t2_fk_t1 foreign key(id1, id2) references t1;

It’s quite simple – I’ve got a multi-column primary key, and it’s worth compressing on the first column because that column is fairly repetitive. Then I’ve created another table that has a foreign key constraint referencing my first table. Because I’ve got some replication going on and want to enable parallelism I’ve enabled rowdependencies all over the place. So let’s insert a couple of rows and see what happens – the next bit of text is cut-n-pasted from an SQL*Plus session running a script after a call to set echo on:

SQL> insert into t1 values(1,1,1,'x');

1 row created.

SQL> commit;

Commit complete.

SQL> insert into t2 values(1,1,1,1,'x');
insert into t2 values(1,1,1,1,'x')
ERROR at line 1:
ORA-00600: internal error code, arguments: [25027], [5], [394178], [], [], [], [], [], [], [], [], []

For further details, and before you get completely thrilled at the possibility of compressing lots of indexes, keep an eye on:


“Inserting into a table with a foreign key where the base table has a primary key using index key compression and the table also has row dependencies enabled. Stack will include kdsgrds and kdiexi0 (in 12) / kdiexi (in 11, 10)”

As implied by that note from the bug, it also affects 10g.  The bug note reports it as fixed in 12.2.

February 10, 2014

Row Migration

Filed under: Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 6:55 pm BST Feb 10,2014

At one of the presentations I attended at RMOUG this year the presenter claimed that if a row kept increasing in size and had to migrate from block to block as a consequence then each migration of that row would leave a pointer in the previous block so that an indexed access to the row would start at the original table block and have to follow an ever growing chain of pointers to reach the data.

This is not correct, and it’s worth making a little fuss about the error since it’s the sort of thing that can easily become an urban legend that results in people rebuilding tables “for performance” when they don’t need to.

Oracle behaves quite intelligently with migrated rows. First, the migrated row has a pointer back to the original location and if the row has to migrate a second time the first place that Oracle checks for space is the original block, so the row might “de-migrate” itself; however, even if it can’t migrate back to the original block, it will still revisit the original block to change the pointer in that block to refer to the block it has moved on to – so the row is never more than one step away from its original location. As a quick demonstration, here’s some code to generate and manipulate some data:

create table t1 (
id	number(6,0),
v1	varchar2(1200)
pctfree 0

prompt	==========================================
prompt	The following code fits 74 rows to a block
prompt	==========================================

insert into t1 select rownum - 1, rpad('x',100) from all_objects where rownum <= 75;

prompt	======================================
prompt	Make the first row migrate and dump it
prompt	======================================

update t1 set v1 = rpad('x',400) where id = 0;

alter system flush buffer_cache;
execute dump_seg('t1',2)

prompt	===========================================================
prompt	Fill the block the long row is now in, force it to migrate,
prompt	then dump it again.
prompt	===========================================================

insert into t1 select rownum + 75, rpad('x',100) from all_objects where rownum <= 75;

update t1 set v1 = rpad('x',800) where id = 0;

alter system flush buffer_cache;
execute dump_seg('t1',3)

prompt	========================================================
prompt	Fill the block the long row is now in and shrink the row
prompt	to see if it returns to its original block. (It won't.)
prompt	========================================================

insert into t1 select rownum + 150, rpad('x',100) from all_objects where rownum <= 75;

update t1 set v1 = rpad('x',50) where id = 0;

alter system flush buffer_cache;
execute dump_seg('t1',3)

prompt	========================================================
prompt	Make a lot of space in the first block and force the row
prompt	to migrate again to see if it migrates back. (It does.)
prompt	========================================================

delete from t1 where id between 1 and 20;

update t1 set v1 = rpad('x',1200) where id = 0;

alter system flush buffer_cache;
execute dump_seg('t1',3)

My test database was using 8KB blocks (hence the 74 rows per block), and 1MB uniform extents with freelist management. The procedure dump_seg() takes a segment name as its first parameter and a number of blocks as the second (then the segment type and starting block as the third and fourth) and dumps the first N data blocks of the segment. To demonstrate what goes on, I’ve extracted the content of the first row (id = 0) after each of the four dumps:

After the first update - the column count (cc) is zero and the "next rowid" (nrid) is row 1 of block 0x0140000b

tab 0, row 0, @0xb3
tl: 9 fb: --H----- lb: 0x2  cc: 0
nrid:  0x0140000b.1

After the second update - the next rowid is row 7 of block 0x0140000c

tab 0, row 0, @0xb3
tl: 9 fb: --H----- lb: 0x1  cc: 0
nrid:  0x0140000c.7

After the third update (shrinking the row) the row hasn't moved from block 0x0140000c

tab 0, row 0, @0xb3
tl: 9 fb: --H----- lb: 0x2  cc: 0
nrid:  0x0140000c.7

After the fourth update (making space, and growing the row too much) the row moves back home

tab 0, row 0, @0x4c1
tl: 1208 fb: --H-FL-- lb: 0x2  cc: 2
col  0: [ 1]  80
col  1: [1200]
78 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

My calls to dump blocks included the blocks where the row migrated to, so we’ll have a look at the target locations (as given by the original row location’s nrid) in those blocks over time. First we check block 0x0140000b, row 1 after the first two migrations:

tab 0, row 1, @0x1d7f
tl: 414 fb: ----FL-- lb: 0x2  cc: 2
hrid: 0x0140000a.0
col  0: [ 1]  80
col  1: [400]

tab 0, row 1, @0x1d7f
tl: 2 fb: ---DFL-- lb: 0x1

After the first migration (the row arrives here) we have a “head rowid” (hrid) pointer telling us where the row came from. After the second migration, when the row has moved on, we simply have a typical “deleted stub” – two bytes reserving the row directory entry until the commit has been done and cleaned out.

Then we can examine the second target (0x140000c, row 7) on the second and third and fourth updates:

tab 0, row 7, @0x1966
tl: 814 fb: ----FL-- lb: 0x2  cc: 2
hrid: 0x0140000a.0
col  0: [ 1]  80
col  1: [800]

tab 0, row 7, @0xb1
tl: 62 fb: ----FL-- lb: 0x1  cc: 2
hrid: 0x0140000a.0
col  0: [ 1]  80
col  1: [50]

tab 0, row 7, @0xb1
tl: 2 fb: ---DFL-- lb: 0x2

As you can see, on arrival this location gets the original rowid as its “head rowid” (hrid), and it knows nothing about the intermediate block where the row was briefly in transit. I’ve copied the length byte (in square brackets) of column 1 in the dumps so that you can see that the row stayed put as it shrank. We can then see on the last update that we are left with a deleted stub in this block as the row migrates back to its original location when we try to extend it beyond the free space in this block.

Migrated rows are only ever one step away from home. It’s not nice to have too many of them, but it’s not necessarily a disaster.

February 9, 2014

IOT Catalogue

Filed under: Infrastructure,IOT,Oracle — Jonathan Lewis @ 6:52 pm BST Feb 9,2014

I’ve just added a catalogue of Richard Foote’s articles on IOTs to the list I made a couple of years ago of Martin Widlake’s articles, so this is just a temporary note to point people to the updated list.

February 2, 2014


Filed under: clusters,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 10:27 pm BST Feb 2,2014

It doesn’t matter which bit of Oracle technology you want to use, eventually someone, somewhere, runs into the special case where something nasty happens. Here’s an edge case for people using (index) clusters – Oracle Bug  17866999 ora-1499 for cluster following rman convert

It comes from a conversation on Oracle-L where Jack van Zanen reported a problem of inconsistent results after migrating data between platforms using rman to converts some tablespaces containing index clusters. This is the starting post where he shows a query that is clearly getting the wrong answer (select where channel_number = 503 obviously shouldn’t return data with channel_number 501).

Unfortunately the “audit-trail” on this problem is incomplete because I accidentally started sending email to Jack instead of the list (reply, rather than reply all) – but given that the problem column was the cluster key of an index cluster the obvious guess was that something had gone wrong as the data for this key value had grown and chained into a second (or third) block. To confirm this I asked Jack to select the rowids reported for the cluster and use the dbms_rowid package to convert them into distinct (file_number, block_number) values:

	dbms_rowid.rowid_relative_fno(rowid)	file_no,
	dbms_rowid.rowid_block_number(rowid)	block_no,
	count(*)				rows_in_block
	&m_table	t1
group by
order by

With a few file and block numbers in hand, we then needed to see the results from dumping the blocks. Jack sent me the trace files, and I found that my guess was correct – and also discovered that I had been on a private discussion, so I posted the results back to the list. As you can see from that posting, the cluster key (indicated by the initial K in the flag byte, and by its presence in “tab 0″ in the block) is 4 columns long, and I’ve got entries from two blocks where the cluster keys are pointing to each other – but the cluster keys don’t hold the same value.

It’s an extraordinary corruption – the data structures appear to be totally self-consistent and perfectly clean, and yet the data content is wrong.  (Usually I find that a corruption means the data becomes unusable).

Oracle eventually tracked this down and reproduced it; it’s a problem that can arise if you are using index clusters and use rman to do a cross-platform conversion – and it’s not going to be easy to fix it. I don’t suppose many people are likely to see this bug (it’s been around for years, apparently, but there aren’t many people who use clusters, and of those perhaps very few have migrated between platforms using rman) – but at present the only workaround is to export and reimport the data; which is not nice when you’re talking about terabytes of clusters.

January 14, 2014

Single block reads

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 6:52 pm BST Jan 14,2014

When a “cache read” tablescan (or index fast full scan) takes place we generally expect to see waits on “db file scattered read” as Oracle performs multi-block reads to do the scan. But we all know that Oracle will skip over blocks that are already in the cache and can therefore end up doing multi-block reads of many different sizes, even down to the point where it does single block reads (waiting for “db file sequential read”).

A quirky little question came up on OTN a little while ago: “for a large table we expect multiblock reads to be positioned at the end of the LRU for prompt re-use; but when Oracle does a single block read as part of a tablescan does it go to the end of the LRU (because it’s part of a large tablescan) or does it go to the mid-point of the LRU (because it’s a single block read)?”

The description of how blocks are treated in a tablescan has been simplified, of course, but the question is still valid – so what’s the answer, and how (without going into an extreme level of detail) would you demonstrate it ?


January 13, 2014


Filed under: Infrastructure,Oracle — Jonathan Lewis @ 10:09 am BST Jan 13,2014

I see Tanel has just started a short series of articles on where the X$ data come from so, for those with time on their hands (?anyone?) here’s a little script I wrote a few years ago when I wanted to check which v$ objects corresponded to the same internal structure in the SGA: (as v$session and v$session_wait used to)

It starts with a function that has to be created in the SYS schema – so no doing this on production systems. The it’s a pipelined function so that we can treat its output like a table, which means I need to create an object type and a table type before creating the function.  In the function I select each x$ name from the list of x$ names (x$kqfta) and for each table I construct a dynamic SQL statement selecting the first row from the table.

Over the versions of Oracle, various bugs have resulted in deadlocks and crashes when selecting a row from particular X$ objects, so there a short list of exceptions that has grown gradually over time.  The code has some calls to dbms_output(), which you can choose to hide by setting serveroutput off.  Depending on your system you may see a very large number of X$ with no data in them.

create type jpl_row as
	object(x_address raw(8), x_name varchar2(30), x_indx number);

create type jpl_table as
	table of jpl_row;

create or replace function x_first
return jpl_table pipelined
/*	Created by x_link.sql	*/
--	Pipeline function to examine addresses of data in x$ objects
--	Uses types jpl_row and jpl_table to implement the pipeline
	dg_broker_problem exception;
	pragma EXCEPTION_INIT(dg_broker_problem, -16525);

	logminer_problem exception;
	pragma EXCEPTION_INIT(logminer_problem, -1306);

	logminer_problem2 exception;
	pragma EXCEPTION_INIT(logminer_problem2, -1307);

	mem_advisor_problem exception;
	pragma exception_init(mem_advisor_problem, -1007);

	deadlock exception;
	pragma EXCEPTION_INIT(deadlock, -60);

	no_table exception;
	pragma EXCEPTION_INIT(no_table, -942);

	m_addr		x$kqfta.addr%type;
	m_indx		x$kqfta.indx%type;
	m_table		varchar2(30);

	for c1 in (select kqftanam, indx from x$kqfta) loop
			execute immediate
				' select indx, addr, ''' || c1.kqftanam ||
				''' from ' || c1.kqftanam ||
				' where rownum = 1'
				into m_indx, m_addr, m_table;
				dbms_output.put_line(m_addr || ' ' || m_table || ' ' || c1.indx);
				pipe row (jpl_row(m_addr, m_table, c1.indx));
			when no_data_found then
				dbms_output.put_line('Warning: No data-' || ' ' || c1.kqftanam || ' ' || c1.indx);
			when no_table then
				dbms_output.put_line('Warning: No table-' || ' ' || c1.kqftanam || ' ' || c1.indx);
			when logminer_problem then
				dbms_output.put_line('Warning: logminer issue' || ' ' || c1.kqftanam || ' ' || c1.indx);
			when logminer_problem2 then
				dbms_output.put_line('Warning: logminer issue' || ' ' || c1.kqftanam || ' ' || c1.indx);
			when dg_broker_problem then
				dbms_output.put_line('Warning: DataGuard issue' || ' ' || c1.kqftanam || ' ' || c1.indx);
			when mem_advisor_problem then
				dbms_output.put_line('Warning: Memory Advisor' || ' ' || c1.kqftanam || ' ' || c1.indx);
			when deadlock then
				dbms_output.put_line('Warning: deadlock' || ' ' || c1.kqftanam || ' ' || c1.indx);
			when others then
				dbms_output.put_line('Warning: other' || ' ' || c1.kqftanam || ' ' || c1.indx);
	end loop;

show errors

Once the function has been created it’s a simple matter to call it, ordering the output by (for example) x_addr.

order by
	x_address, x_indx

X_ADDRESS        X_NAME                             X_INDX
---------------- ------------------------------ ----------
0000000000000000 X$KSUTM                                41
                 X$KSUSGIF                              42
                 X$KTTVS                               477
                 X$KDXHS                               487
                 X$KZSRO                               528
00000000099EC500 X$KGSKCP                              139
00000000099EC560 X$KGSKPP                              140
00000000099EC5D0 X$KGSKASP                             141
000000000BEE9560 X$KSFDFTYP                            119
000000000C0F5B1C X$KQDPG                               495
000000000C0F7660 X$KQFTVRTTST0                         863

0000000060000000 X$KSMMEM                               71
0000000060001190 X$KSMSP_DSNEW                          82
00000000600077CC X$KGICS                               513
00000000AAE1B588 X$KSQST                                32
00000000AAE35598 X$KSRPCIOS                            118
00007FFB03E92478 X$DBGDIREXT                           820
00007FFB04274F50 X$KSMSP                                75
00007FFB045D4E28 X$ACTIVECKPT                          270
00007FFB093A7B48 X$KXFPSST                             569
00007FFB093A9D48 X$KOCST                               638

473 rows selected.

This was 64-bit Oracle version – and I’ve only recorded data in 473 of the possible 950 x$ structures.
As far as Tanel’s series is concerned a key detail here is the way in which you can break the range into 4 chunks:

  • The zeros – x$ which don’t take any memory but simply return the result of a real-time function call.
  • The low range (up to C0F7660) which is a fixed data segment (whose technical name in x86-speak I forget) in the SGA
  • The mid range (60000000 to AAE35598) which is the SGA heap
  • The high range (from 7FFB03E9247 upwards) which is user space (PGA/UGA addressing)

As Tanel’s first post explains, and as you can infer from the sample listing above, when you query x$ksmsp, you are querying a structure in user space..



January 8, 2014

CR Trivia

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 6:44 am BST Jan 8,2014

Everybody “knows” that when you do a tablescan of a table that it starts with two buffer gets on the segment header, though older versions (ca. 8i and earlier) of Oracle used to do 4 buffer gets on the segment header. The upshot of this is that many people still say that if you create a table and insert a single row then you’re going to get 3 buffer gets when you tablescan a table: two for the segment header and one for the data block:

So here’s a test, with the second set of autotrace stats which, for reasons I’ll describe shortly, may not be immediately reproducible on your system:

create table t1 (n1 number);
insert into t1 values(1);
execute dbms_stats.gather_table_stats(user,'t1');

set autotrace traceonly statistics
select * from t1;

          1  recursive calls
          0  db block gets
         22  consistent gets
          0  physical reads
          0  redo size
        540  bytes sent via SQL*Net to client
        544  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

That really is 22 buffer gets to find the one row in the table. It gets worse; I used another session to insert (and commit) a second row in the table and the tablescan to return the two rows took 36 buffer gets.

If you want to get the same results you have to be a bit careful and a bit lucky. I was using ASSM (automatic segment space management) on a tablespace with an 8MB uniform extent size; when I inserted the single row into the table Oracle formatted a batch of 16 consecutive blocks in the first extent, deciding which range of blocks to format based on my process id. When I inserted my row, the block I inserted into was again dictated by my process id – this happened to be the 10th block in the formatted range.

When I ran the tablescan I did 6 gets on the segment’s space management blocks to work out what parts of the segment were formatted, then 10 gets to find the first row in the 10th block of the range, then a further 6 gets to scan the rest of the formatted range. I do not know why it takes 6 gets to read the space management blocks – but it may be two each for the segment header block, L2 bitmap block, and L1 bitmap block – it may be four on the segment header and one each on the other two blocks.

When I inserted a second row from another session (which I had set up very carefully), Oracle decided to format another batch of 16 blocks. This highlighted a little variation on what I’ve said so far. Because I had 8MB uniform extents the first 18 blocks of the first extent were all space management blocks (segment header, one L2 bitmap, and 16 L1 bitmaps – in the opposite order to that description). Oracle formats on boundaries of 16 blocks in the extent and in this case formatted the 14 blocks that would take it from block 19 to block 32 of the extent. That’s basically why my tablescan after the second insert took an extra 14 gets.

If you really do need to scan a “very small” table extremely frequently (and you shouldn’t, really) then it might be a good idea to check how many blocks have been formatted compared to the number of blocks used (dbms_rowid.rowid_block_number() will help with that check) and do an “alter table move” because in this special case Oracle won’t use the “format 16 blocks somewhere” strategy, it will format only as many blocks as needed starting from the first available block in the segment. For some reason the number of gets on space management blocks is also minimised in this case so that tablescan of a “single block” takes only 2 + number of blocks in table.

For ASSM segments Oracle maintains a Low High Water Mark (LHWM) and a High High Water Mark (HHWM). Every block below the LHWM is guaranteed to be formatted, blocks between the LHWM and HHWM will be formatted in batches of 16; so when doing a segment scan Oracle uses the largest multiblock-read it can from the start of the segment to the LHWM, then keeps checking the space management blocks to identify the batches of 16 that it can then read (and it can read consecutive formatted batches in a single read, so the reads are not necessarily limited to 16 blocks at a time).

One little side thought – because the choice of block and block range is dictated by the process id, the pattern of data insertion and contention using shared servers can be very different from the pattern produced by dedicated servers.


January 6, 2014

LOB changes

Filed under: Infrastructure,LOBs,Oracle,Troubleshooting — Jonathan Lewis @ 7:10 pm BST Jan 6,2014

It’s always useful to collect baseline information – especially when it helps you notice that the baseline has moved in a way that might explain the next performance problem you see. Here’s an example demonstrating the benefit.

I have a table with a LOB column that is stored out of line. Many years ago I decided that I wanted to compare how the redo generation varied as I change the LOB from cached to nocache (with nologging). So here was one of my simplest test scripts (stripped to a minimum):

create table t1 (
	id		number,
	cached_lob	clob
lob (cached_lob) store as text_lob_cache(
	disable storage in row 
--	nocache logging	

insert into t1
	rownum <= 1000

When I first ran this test (twice, one for each option) I noticed that cached LOBs recorded just the redo needed to describe the data that had been written into them (so at 1,000 rows of 4KB each that would be in the ballpark of 4MB with some overheads), but the nocache option created redo for the entire block (so nearer 8MB, plus overheads). There were, of course, all sorts of other side effects of the change – some of them very nasty – but this was one of the more important differences to note.

A little while ago I was prompted to repeat the test – so I ran the whole thing three times on,, and; it’s fascinating to see the way the key redo stats change:
Name                                                                     Value
----                                                                     -----
redo entries                                                             5,588
redo size                                                            5,288,836

redo entries                                                             5,578
redo size                                                            9,407,824
redo entries                                                             5,223
redo size                                                            5,277,520

redo entries                                                             5,220
redo size                                                            1,212,992
redo size for direct writes                                             52,000
redo entries                                                             5,231
redo size                                                            9,343,444

redo entries                                                             5,291
redo size                                                            1,301,700
redo size for direct writes                                             60,000

As you can see the results mirror the first results I got – the redo size is consistent with the volume of data when the LOBs are cached, and matches whole block dumps when they’re not cached – but something very clever seems to appear in the results – the redo for the lob component of nocache LOBs seems to have disappeared – there’s a drop of (roughly) 8,192,000 bytes from the 10g figures: is this a clever trick (filesystem-like logging ?) or is it a reporting error. Then 12c is even stranger – the redo for the cached LOBs now matches block logging (8MB + overheads) instead of data logging – and the redo for the nocache LOBs is still “missing”.

To be investigated further – possibly by dumping redo log files, but initially simply by watching when log file switches take place – to see if the changes are reporting errors or genuine changes.

Next Page »

The Rubric Theme Blog at


Get every new post delivered to your Inbox.

Join 3,453 other followers