Oracle Scratchpad

February 19, 2015

255 columns

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 12:45 am GMT Feb 19,2015

Here’s a quick note, written and some strange time in (my) morning in Hong Kong airport as I wait for my next flight – all spelling, grammar, and factual errors will be attributed to jet-lag or something.

And a happy new year to my Chinese readers.

You all know that having more than 255 columns in a table is a Bad Thing ™ – and surprisingly you don’t even have to get to 255 to hit the first bad thing about wide tables. If you’ve ever wondered what sorts of problems you can have, here are a few:

  • If you’re still running 10g and gather stats on a table with more than roughly 165 columns then the query Oracle uses to collect the stats will only handle about 165 of them at a time; so you end up doing multiple (possibly sampled) tablescans to gather the stats. The reason why I can’t give you an exact figure for the number of columns is that it depends on the type and nullity of the columns – Oracle knows that some column types are fixed length (e.g. date types, char() types) and if any columns are declared not null then Oracle doesn’t have to worry about counting nulls – so for some of the table columns Oracle will be able to eliminate one or two of the related columns it normally includes in the stats-gathering SQL statement – which means it can gather stats on a few more table columns.  The 165-ish limit doesn’t apply in 11g – though I haven’t checked to see if there’s a larger limit before the same thing happens.
  • If you have more than 255 columns in a row Oracle will split it into multiple row pieces of 255 columns each plus one row piece for “the rest”; but the split counts from the end, so if you have a table with 256 columns the first row-piece has one column the second row-piece has 255 columns. This is bad news for all sorts of operations because Oracle will have to expend extra CPU chasing the the row pieces to make use of any column not in the first row piece. The optimists among you might have expected “the rest” to be in the last row piece. If you want to be reminded how bad row-chaining can get for wide tables, just have a look at an earlier blog note of mine (starting at this comment).
  • A particularly nasty side effect of the row split comes with direct path tablescans – and that’s what Oracle does automatically when the table is large. In many cases all the row pieces for a row will be in the same block; but they might not be, and if a continuation row-piece is in a different block Oracle will do a “db file sequential read” to read that block into the buffer cache.  As an indication of how badly this can affect performance, the results I got at a recent client site showed “select count(col1) from wide_table” taking 10  minutes while “select count(column40) from wide_table” took 22 minutes because roughly one row in a hundred required a single block read to follow the chain. An important side effect of the split point is that you really need to put the columns you’re going to index near the start of the table to minimise the risk of this row chaining overhead when you create or rebuild an index.
  • On top of everything else, of course, it takes a surprisingly large amount of extra CPU to load a large table if the rows are chained. Another client test reported 140 CPU seconds to load 5M rows of 256 columns, but only 20 CPU seconds to load 255.

If you are going to have tables with more than 255 columns, think very carefully about column order – if you can get all the columns that are almost always null at the end of the row you may get lucky and find that you never need to create a secondary row piece. A recent client had about 290 columns in one table of 16M rows, and 150 columns were null for all 16M rows – unfortunately they had a mandatory “date_inserted” column at the end of the row, but with a little column re-arrangement they eliminated row chaining and saved (more than) 150 bytes storage per row.  Of course, if they have to add and back-fill a non-null column to the table they’re going to have to rebuild the table to insert the column “in the middle”, otherwise all new data will be chained and wasting 150 bytes per row, and any old data that gets updated will suffer a row migration/chain catastrophe.

January 26, 2015

In-memory DB

Filed under: 12c,in-memory,Infrastructure,Oracle — Jonathan Lewis @ 8:18 am GMT Jan 26,2015

A recent thread on the OTN database forum supplied some code that seemed to show that In-memory DB made no difference to performance when compared with the traditional row-store mechanism and asked why not.  (It looked as if the answer was that almost all the time for the tests was spent returning the 3M row result set to the SQL*Plus client 15 rows at a time.)

The responses on the thread led to the question:  Why would the in-memory (column-store) database be faster than simply having the (row-store) data fully cached in the buffer cache ?

Maria Colgan has addressed this question in part 3 of her series on In-Memory Database (see catalogue below), but I thought I’d repeat the basic ideas with a few imaginative numbers thrown in to give a better flavour of what’s going on. So imagine you have a table sized at 100GB, with 100 columns of data where every column holds data of a similar size and pattern; and want to execute a query of the form: select {list of columns} from big_table where colX >=  {some constant}.

Traditional Tablescan (approximation) with table fully cached

For each block of the 100GB, Oracle has to acquire the “cache buffers chains” latch, pin the block, drop the latch, and scan the block testing each row, then acquire the latch, unpin the block, and drop the latch.  Scanning the block requires a walk through the row directory and, for each row pointer, jumping to the correct location in the block for the row, stepping along the row one column at a time to get to the correct row, and then checking the column  value. If the column matches the predicate extract, format and return the required columns from that row.

It’s a lot of memory to scan, in a large number of small steps, involving a lot of latching and pinning – which translates into a lot of CPU. On the plus side, although it’s very expensive to identify the required rows, it’s very cheap to construct and return a row once you’ve identified it.

In-memory scan (approximation)

  1. Given the way I’ve described the table (100GB, 100 similar columns), Oracle can recreate it in memory as 100 lists of 1GB each; so we can identify the rows we want by scanning one of those lists and applying the predicate – so only 1GB of (fairly contigious) memory to scan, rather than 100GB, and virtually no latching and pinning to find that memory, and no jumping around following pointers and counting along rows.
  2. But it’s probably NOT 1GB of memory to scan, because Oracle has some compression/deduplication methods that it can apply to the data that often reduces the memory requirement by a factor of four of five – so perhaps it’s only 250 MB of memory to scan.
  3. But Oracle breaks large lists into chunks, so rather than 250MB of contiguous memory, it’s likely to be 250 chunks of 1MB; and as part of the description of each chunk Oracle records the lowest and highest value in the chunk; [ed:  Christian Antognini says that the metadata list every distinct value for the chunk] so it can check the predicate against the boundary values on the chunk and be in a position to discard entire chunks without looking at their content. So, depending on the actual content and pattern of the data, we may examine only a handful of chunks in detail, dropping the scan from 250MB to, perhaps, 10MB.
  4. And we still haven’t finished there, because the code that handles the column-store is able to take advantage of the SIMD (Single Instruction,  Multiple Data) operations in the CPU to check the predicate against 4, or possibly even 8, values simultaneously at a speed matching a single column comparison that has to be used for the traditional cached row-store. (Given the way that Oracle  (probably) handles the compression, I suspect that this final advantage is only significant for range-based predicates – but that’s purely conjectural).

So the huge benefit you get from In-Memory column store, compared to “fully cached row-store”  is that you are likely to scan far less memory to identify the rows that match your predicate, and do it with far less “infrastructure” activity like latching and pinning. The potential saving in CPU usage is huge.

There is, of course, a penalty to pay. As you identify the rows of interest you can (in effect) construct a bitmap representing the position of those rows in the table (and if you have predicates on more than 1 column you can use bitmap operations on the individual column bitmaps to identify the rows you want in the final result) but then you have to construct the row that goes into the result set. If your query is interested in just 5 columns that means using the bitmap to locate the correct entry from each of 5 separate column lists; if your query is interested in 99 column that means extracting the correct entry from each of 99 separate column lists. Identifying the rows you want can be very  quick, building the final result may be relatively slow.

Soundbite summary

  • Using the In-memory Database, you can identify the rows you want very quickly but it’s relatively slow to reconstruct them.
  • Using a fully cached traditional row-store, it’s relatively slow to identify the rows you want, but once you’ve found them you spend no time reconstructing them.

Bear in mind that this is an extremely simplified analysis and ignores all sorts of details about read-consistency, the probability of physical reads, the probability of scanning blocks instead of scanning chunks, and so on; my focus is only on the type of activity that differentiates row-store handling from column-store handling when all the data is in memory so that you can have some appreciation of why the benefits available from In-memory DB can vary with the pattern of the data and the way you use it.

Catalogue of blog posts by Maria Colgan and Andy Rivenes:

Other articles on In-memory DB:

 

January 21, 2015

LOB Space

Filed under: ASSM,Infrastructure,LOBs,Oracle — Jonathan Lewis @ 1:26 pm GMT Jan 21,2015

Following on from a recent “check the space” posting, here’s another case of the code not reporting what you thought it would, prompted by a question on the OTN database forum about a huge space discrepancy in LOBs.

There’s a fairly well-known package called dbms_space that can give you a fairly good idea of the space used by a segment stored in a tablespace that’s using automatic segment space management. But what can you think when a piece of code (written by Tom Kyte, no less) reports the following stats about your biggest LOB segment:


Unformatted Blocks .....................             107
FS1 Blocks (0-25)  .....................               0
FS2 Blocks (25-50) .....................               0
FS3 Blocks (50-75) .....................               0
FS4 Blocks (75-100).....................               0
Full Blocks        .....................         859,438
Total Blocks............................       1,746,304
Total Bytes.............................  14,305,722,368

Of the available 1.7M blocks approximately 890,000 seem to have gone missing!

I tend to think that the first thing to do when puzzled by unexpected numbers is to check for patterns in the arithmetic. First (though not particularly interesting) the LOB segment seems to be using the standard 8KB blocksize: 1,746,304 * 8192 = 14,305,722,368; more interestingly, although only approximately true, the number of full blocks is pretty close to half the total blocks – does this give you a hint about doing a little test.


create table test_lobs (
        id              number(5),
        text_content    clob
)
lob (text_content) store as text_lob(
        disable storage in row
        chunk 32K
        tablespace test_8k_1m_assm
)
;

begin
        for i in 1..1000 loop
                insert into test_lobs values(
                        i, 'x'
                );
                commit;
        end loop;
end;
/

I’ve created a table with a LOB segment, storing LOBs out of row with a chunk size of 32KB in a tablespace which (using my naming convention) is locally managed, 8KB blocksize, uniform 1MB extents, using ASSM. So what do I see if I try to check the space usage through calls to the dbms_space package ? (There’s some sample code to do this in the comments of the blog I linked to earlier, but I’ve used some code of my own for the following – the first figure reported is the blocks, the second the bytes):


====
ASSM
====

Unformatted                   :      119 /    3,899,392
Freespace 1 (  0 -  25% free) :        0 /            0
Freespace 2 ( 25 -  50% free) :        0 /            0
Freespace 3 ( 50 -  75% free) :        0 /            0
Freespace 4 ( 75 - 100% free) :        0 /            0
Full                          :    1,000 /   32,768,000

=======
Generic
=======
Segment Total blocks: 4224
Object Unused blocks: 0

Apparently the segment has allocated 4,224 blocks, but we’ve only used 1,000 of them, with 119 unformatted and 3,105 “missing”; strangely, though, the 1,000 “Full blocks” are simultaneously reported as 32,768,000 bytes … and suddenly the light dawns. The dbms_space package is NOT counting blocks, it’s counting chunks; more specifically it’s counting “bits” in the bitmap space management blocks for the LOB segment and (I think I’ve written this somewhere, possibly as far back as Practical Oracle 8i) the bits in a LOB segment represent chunks, not blocks.

Conclusion:

The OP has set a 16KB chunksize with a 8KB block size. His numbers look fairly self-consistent: 1,746,304 –  (2 * 859,438) – 107 = 27,321; the difference is about 1.6% of the total allocation, which is in the right ballpark for the space management blocks, especially if the segment is in a tablespace using with 1MB uniform extents.

 

January 16, 2015

Spatial space

Filed under: Infrastructure,Oracle,Statistics,Troubleshooting — Jonathan Lewis @ 1:00 pm GMT Jan 16,2015

One thing you (ought to) learn very early on in an Oracle career is that there are always cases you haven’t previously considered. It’s a feature that is frequently the downfall of “I found it on the internet” SQL.  Here’s one (heavily paraphrased) example that appeared on the OTN database forum a few days ago:

select table_name,round((blocks*8),2)||’kb’ “size” from user_tables where table_name = ‘MYTABLE';

select table_name,round((num_rows*avg_row_len/1024),2)||’kb’ “size” from user_tables where table_name = ‘MYTABLE';

The result from the first query is 704 kb,  the result from the second is 25.4 kb … fragmentation, rebuild, CTAS etc. etc.

The two queries are perfectly reasonable approximations (for an 8KB block size, with pctfree of zero) for the allocated space and actual data size for a basic heap table – and since the two values here don’t come close to matching it’s perfectly reasonable to consider doing something like a rebuild or shrink space to reclaim space and (perhaps) to improve performance.

In this case it doesn’t look as if the space reclaimed is likely to be huge (less than 1MB), on the other hand it’s probably not going to take much time to rebuild such a tiny table; it doesn’t seem likely that the rebuild could make a significant difference to performance (though apparently it did), but the act of rebuilding might cause execution plans to change for the better because new statistics might appear as the rebuild took place. The figures came from a test system, though, so maybe the table on the production system was much larger and the impact would be greater.

Being cautious about wasting time and introducing risk, I made a few comments about the question –  and learned that one of the columns was of type SDO_GEOMETRY. This makes a big difference about what to do next, because dbms_stats.gather_table_stats() doesn’t process such columns correctly, which results in a massive under-estimate for the avg_row_len (which is basically the sum of avg_col_len for the table). Here’s an example (run on 12c, based on some code taken from the 10gR2 manuals):


drop table cola_markets purge;

CREATE TABLE cola_markets (
  mkt_id NUMBER,
  name VARCHAR2(32),
  shape SDO_GEOMETRY);

INSERT INTO cola_markets VALUES(
  1,
  'cola_a',
  SDO_GEOMETRY(
    2003,  -- two-dimensional polygon
    NULL,
    NULL,
    SDO_ELEM_INFO_ARRAY(1,1003,3), -- one rectangle (1003 = exterior)
    SDO_ORDINATE_ARRAY(1,1, 5,7) -- only 2 points needed to
          -- define rectangle (lower left and upper right) with
          -- Cartesian-coordinate data
  )
);

insert into cola_markets select * from cola_markets;
/
/
/
/
/
/
/
/
/

execute dbms_stats.gather_table_stats(user,'cola_markets')
select
	avg_row_len, num_rows, blocks,
	round(avg_row_len * num_rows / 7200,0) expected_blocks
from user_tables where table_name = 'COLA_MARKETS';

analyze table cola_markets compute statistics;
select
	avg_row_len, num_rows, blocks,
	round(avg_row_len * num_rows / 7200,0) expected_blocks
from user_tables where table_name = 'COLA_MARKETS';

If you care to count the number of times I execute the “insert as select” it’s 10, so the table ends up with 2^10 = 1024 rows. The 7,200 in the calculated column converts bytes to approximate blocks on the assumption of 8KB blocks and pctfree = 10. Here are the results following the two different methods for generating object statistics:


PL/SQL procedure successfully completed.

AVG_ROW_LEN   NUM_ROWS     BLOCKS EXPECTED_BLOCKS
----------- ---------- ---------- ---------------
         14       1024        124               2

Table analyzed.

AVG_ROW_LEN   NUM_ROWS     BLOCKS EXPECTED_BLOCKS
----------- ---------- ---------- ---------------
        109       1024        124              16

Where does the difference in Expected_blocks come from ? (The Blocks figures is 124 because I’ve used 1MB uniform extents – 128 block – under ASSM (which means 4 space management blocks at the start of the first extent.)

Here are the column lengths after the call to dbms_stats: as you can see the avg_row_len is the sum of avg_col_len.


select column_name, data_type, avg_col_len
from   user_tab_cols
where  table_name = 'COLA_MARKETS'
order by
        column_id
;

COLUMN_NAME          DATA_TYPE                AVG_COL_LEN
-------------------- ------------------------ -----------
MKT_ID               NUMBER                             3
NAME                 VARCHAR2                           7
SYS_NC00010$         SDO_ORDINATE_ARRAY
SHAPE                SDO_GEOMETRY
SYS_NC00008$         NUMBER                             0
SYS_NC00004$         NUMBER                             4
SYS_NC00005$         NUMBER                             0
SYS_NC00006$         NUMBER                             0
SYS_NC00007$         NUMBER                             0
SYS_NC00009$         SDO_ELEM_INFO_ARRAY

The figures from the analyze command are only slightly different, but fortunately the analyze command uses the row directory pointers to calculate the actual row allocation, so picks up information about the impact of inline varrays, LOBs, etc. that the dbms_stats call might not be able to handle.


COLUMN_NAME          DATA_TYPE                AVG_COL_LEN
-------------------- ------------------------ -----------
MKT_ID               NUMBER                             2
NAME                 VARCHAR2                           6
SYS_NC00010$         SDO_ORDINATE_ARRAY
SHAPE                SDO_GEOMETRY
SYS_NC00008$         NUMBER                             1
SYS_NC00004$         NUMBER                             3
SYS_NC00005$         NUMBER                             1
SYS_NC00006$         NUMBER                             1
SYS_NC00007$         NUMBER                             1
SYS_NC00009$         SDO_ELEM_INFO_ARRAY

As a basic reminder – whenever you do anything slightly non-trivial (e.g. something you couldn’t have done in v5, say) then remember that all those dinky little script things you find on the Internet might not actually cover your particular case.

December 23, 2014

Just in case

Filed under: Infrastructure,Oracle,Performance,redo — Jonathan Lewis @ 10:37 am GMT Dec 23,2014

For those who don’t read Oracle-l and haven’t found Nikolay Savvinov’s blog, here’s a little note pulling together a recent question on Oracle-L and a relevant (and probably unexpected) observation from the blog. The question (paraphrased) was:

The developers/data modelers are creating all the tables with varchar2(4000) as standard by default “Just in case we need it”. What do you think of this idea?

The general answer was that it’s a bad idea (and unnecessary, anyway) and one specific threat that got mentioned was the problem of creating indexes and Oracle error ORA-01450; but coincidentally Nikolay Savvinov had written about a performance-related “bug” in August this year, which turned out, only last week, to be expected behaviour. You can read his articles for the details, but since he used a 4KB block size to demonstrate it I thought I’d repeat the exercise using an 8KB block size.

 



drop table t1 purge;
create table t1 (id number(6), v1 varchar(40), v2 varchar2(40), v3 varchar2(40));
create unique index t1_i1 on t1(id);

execute snap_redo.start_snap

insert into t1 
select	object_id, object_name, object_name, object_name 
from	all_objects
where	rownum <= 10000
;

execute snap_redo.end_snap


drop table t1 purge;
create table t1 (id number(6), v1 varchar(4000), v2 varchar2(4000), v3 varchar2(4000));
create unique index t1_i1 on t1(id);

execute snap_redo.start_snap

insert into t1 
select	object_id, object_name, object_name, object_name 
from	all_objects
where	rownum <= 10000
;

execute snap_redo.end_snap

I’ve dropped and created the same table twice, once with varchar2(40) columns and once with varchar2(4000) columns.

I’ve created an index on a (non-character) column – the specific results vary depending on whether the index is unique or non-unique, and whether or not you have the index, and whether or not the table already holds data, and the effective clustering on the index columns etc. etc. but the key difference between the two sets of results doesn’t go away.

I’ve inserted object_name values (maximum usage 32 bytes) into the varchar2() columns, inserting 10,000 rows.

The snap_redo package is one of my simple pre/post packages that calculates changes in values in some dynamic performance view – in this case it’s looking at v$sysstat (system statistics) for statistics relating to redo generation, which means you need to run this test on an otherwise idle instance. Here are the two sets of results from an instance of 11.2.0.4:


Name                                                                     Value
----                                                                     -----
messages sent                                                               11
messages received                                                           11
calls to kcmgcs                                                            313
calls to kcmgas                                                             37
calls to get snapshot scn: kcmgss                                           74
redo entries                                                               769
redo size                                                            1,317,008
redo wastage                                                             3,888
redo writes                                                                 11
redo blocks written                                                      2,664
redo write time                                                             10
redo blocks checksummed by FG (exclusive)                                2,242
redo ordering marks                                                          1
redo subscn max counts                                                       1
redo synch time                                                              7
redo synch time (usec)                                                  88,875
redo synch time overhead (usec)                                          1,810
redo synch time overhead count (<2 msec)                                    11
redo synch writes                                                           11
redo write info find                                                        11
undo change vector size                                                261,136
rollback changes - undo records applied                                      2
IMU undo allocation size                                                17,184


Name                                                                     Value
----                                                                     -----
messages sent                                                                8
messages received                                                            8
calls to kcmgcs                                                            222
calls to kcmgas                                                             56
calls to get snapshot scn: kcmgss                                           52
redo entries                                                            20,409
redo size                                                            5,606,872
redo buffer allocation retries                                               1
redo wastage                                                             1,248
redo writes                                                                  6
redo blocks written                                                     11,324
redo write time                                                             26
redo blocks checksummed by FG (exclusive)                                  571
redo ordering marks                                                         32
redo subscn max counts                                                       1
redo synch time                                                              6
redo synch time (usec)                                                  60,230
redo synch time overhead (usec)                                            159
redo synch time overhead count (<2 msec)                                     1
redo synch writes                                                            1
redo write info find                                                         1
undo change vector size                                              1,590,520
IMU undo allocation size                                                   144

Notice, particularly, the great change in the number of redo entries and the total redo size when the character columns are defined at varchar2(4000). Note particularly that the number of redo entries is roughly “2 * number of rows inserted” – for each row that’s one for the row and one for the index entry. You can check the redo log content by dump the log file, of course (and Nikolay did), or you can take my word for it that Oracle is doing the equivalent of single row processing in the varchar2(4000) case and array processing in the varchar2(40) case.

When Oracle calculates that the row length is larger than the block size it falls back to single row processing; this can increase your redo generation significantly, and since the rate at which you can pump out redo is the ultimate rate at which you can load data this could have a significant impact on your data loading times. Declaring character columns as varchar2(4000) “just in case” is a bad idea.

 

November 14, 2014

Shrink Tablespace

Filed under: fragmentation,Infrastructure,Oracle — Jonathan Lewis @ 6:16 pm GMT Nov 14,2014

If you start moving objects around to try and reclaim space in a tablespace there are all sorts of little traps that make it harder than you might hope to get the maximum benefit with the minimum effort.  I’ve written a couple of notes in the past about how to proceed and, more recently, one of the difficulties involved. This is just a brief note about a couple of ideas to make life a little easier.

  • Check that you’ve emptied the recyclebin before you start
  • Before you try moving/rebuilding an object check that the total free space “below” that object is greater than  the size of the object or you’ll find that parts of the object move “up” the tablespace.
  • Before moving a table, mark its indexes as unusable. If you do this then (in recent versions of Oracle) the default behaviour is for the index space be freed as the segment vanishes and you may find that the extents of the table can move further “down” the tablespace.  (If you’ve kept tables and indexes in different tablespaces this is irrelevant, of course).
  • When you move an object think a little carefully about whether specifying an minimum initial extent size would help or hinder the move.
  • Don’t assume that moving the “highest” object first is the best strategy – work out where you expect the final tablespace HWM to be and you may find that it makes more sense to move other objects that are above the point first.
  • Moving objects with several small (64KB) extents first may allow you to free up larger (1MB, 8MB) gaps that can be used by other larger objects”””””
  • Creating a new tablespace and moving objects to it from the old tablespace “top down” may be the quickest way to proceed.  Work towards dropping the old tablespace HWM regularly.

 

November 7, 2014

Quiz night

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 6:37 pm GMT Nov 7,2014

Prompted by an email from Yves Colin (who’ll be presenting on the Tuesday of UKOUG Tech14) I was prompted to dig out a little script I wrote some years ago and re-run an old test, leading to this simple question:  what’s the largest size array insert that Oracle will handle ?

If you’re tempted to answer, watch out – it’s not exactly a trick question, but there is a bit of a catch.

Answer:

There is an internal limit of 255 on the size of array that Oracle can insert as a unit. I leave it as an exercise to the reader to decide whether or not this makes any significant difference to performance, since the effects of row size, number of indexes maintained, and disk I/O requirements may make the effects of the limit virtually undetectable. To demonstrate the limit all we need do is insert a few hundred rows into a table and do a block dump, optionally followed by a dump of the redo log.


create table t1 (n1 number not null) segment creation immediate;

alter system switch logfile;

insert into t1 select rownum from all_objects where rownum <= 1000;
commit;

execute dump_log

start rowid_count t1

/*
   FILE_NO   BLOCK_NO ROWS_IN_BLOCK
---------- ---------- -------------
         5        180           660
         5        184           340

2 rows selected.


ROWS_IN_BLOCK     BLOCKS
------------- ----------
          340          1
          660          1

2 rows selected.
*/

alter system flush buffer_cache;

-- SQL> alter system dump datafile 5 block 180;

The dump_log procedure is simply a pl/sql wrapper for a call to ‘alter system dump logfile {current log}'; I the script rowid_count.sql extracts the file and block numbers from rowids in the given table and aggregates them in different ways.  The reason for running the script is to find a table block with a lot of rows in it; the block I dumped actually held the first 660 rows of the insert. Here’s a tiny extract from the block dump (with one little comment added):


tab 0, row 0, @0x1904			-- 6,400 dec
tl: 6 fb: --H-FL-- lb: 0x1  cc: 1
col  0: [ 2]  c1 02
tab 0, row 1, @0x190a
tl: 6 fb: --H-FL-- lb: 0x1  cc: 1
col  0: [ 2]  c1 03
tab 0, row 2, @0x1910
tl: 6 fb: --H-FL-- lb: 0x1  cc: 1
col  0: [ 2]  c1 04
tab 0, row 3, @0x1916
tl: 6 fb: --H-FL-- lb: 0x1  cc: 1

This shows the first four rows inserted, and you can see that the offset to the first row is 6,400 bytes, and then each subsequent row is 6 bytes further down the block (when the numbers get larger the row lengths will increase to 7 bytes). The positioning of these rows is, at first sight, a little odd – you might have guessed that they would either start at offset “zero” and work down to offset “8K”, or start at “8K” and work backwards up to zero – why is the first row 3/4 of the way down the block ?

Rather than show you more row dumps, I’ll give you a carefully selected extract from the row directory:


0x12:pri[0]     offs=0x1904
0x14:pri[1]     offs=0x190a

...

0x20c:pri[253]  offs=0x1f8a
0x20e:pri[254]  offs=0x1f91		End of first 255 rows
0x210:pri[255]  offs=0x120e		Start of second 255 rows
0x212:pri[256]  offs=0x1215

...

0x40a:pri[508]  offs=0x18f6
0x40c:pri[509]  offs=0x18fd		End of second 255 rows
0x40e:pri[510]  offs=0xdf5		Start of last 150 rows
0x410:pri[511]  offs=0xdfc

...

0x536:pri[658]  offs=0x1200		End of last 150 rows
0x538:pri[659]  offs=0x1207

The first 255 rows inserted are stacked at the bottom of the block at offsets 0x1904 to 0x1f91.
The second 255 rows inserted are stacked above them at offsets 0x120e to 0x18fd (note 0x18fd + 6 = 0x1903)
The last 150 rows inserted are stack above them at offsets 0xdf5 to 0x1207 (note 0x1207 + 6 = 0x120d)

No matter how large your attempted array insert, the maximum number of rows (or index entries) Oracle can insert into a block in a single internal array operation is 255.

Further corroboration comes from the redo log dump – here’s a tiny bit of a single change vector (i.e. a single atomic change to a single Oracle block) from the redo generated while this insert was going on:


CHANGE #18 CON_ID:0 TYP:0 CLS:1 AFN:5 DBA:0x014000c7 OBJ:95876 SCN:0x0000.007528dd SEQ:1 OP:11.11 ENC:0 RBL:0 FLG:0x0000
KTB Redo
op: 0x01  ver: 0x01
compat bit: 4 (post-11) padding: 1
op: F  xid:  0x0007.007.000009f2    uba: 0x01000bda.02a9.14
KDO Op code: QMI row dependencies Disabled
  xtype: XA flags: 0x00000000  bdba: 0x014000c7  hdba: 0x01400083
itli: 1  ispac: 0  maxfr: 4858
tabn: 0 lock: 1 nrow: 255
slot[0]: 0
tl: 6 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 2]  c1 02
slot[1]: 1
tl: 6 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 2]  c1 03
slot[2]: 2
tl: 6 fb: --H-FL-- lb: 0x0  cc: 1
col  0: [ 2]  c1 04

Although the CON_ID in the line 1 tells you this dump is from 12c the same limit holds across all (modern) versions of Oracle. The operation (OP 11.11 – in line 1) is “Insert Multiple Rows”, and at line 9 you can see: “nrow: 255″. I haven’t included the rest of the change vector, but all it does is show the remaining 252 rows.

October 13, 2014

Memory

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 5:24 pm GMT Oct 13,2014

On a client site recently, experimenting with a T5-2 – fortunately a test system – we decided to restart an instance with a larger SGA. It had been 100GB, but with 1TB of memory and 256 threads (2 sockets, 16 cores per socket, 8 threads per core) it seemed reasonable to crank this up to 400GB for the work we wanted to do.

It took about 15 minutes for the instance to start; worse, it took 10 to 15 seconds for a command-line call to SQL*Plus on the server to get a response; worse still, if I ran a simple “ps -ef” to check what processes were running the output started to appear almost instantly but stopped after about 3 lines and hung for about 10 to 15 seconds before continuing. The fact that the first process name to show after the “hang” was one of the Oracle background processes was a bit of hint, though.

Using truss on both the SQL*Plus call and on the ps call, I found that almost all the elapsed time was spent in a call to shmatt (shared memory attach); a quick check with “ipcs – ma” told me (as you might guess) that the chunk of shared memory identified by truss was one of the chunks allocated to Oracle’s SGA. Using pmap on the pmon process to take a closer look at the memory I found that it consisted of a few hundred pages sized at 256MB and a lot of pages sized at 8KB; this was a little strange since the alert log had told me that the instance was allocating about 1,600 memory pages of 256MB (i.e. 400GB) and 3 pages of 8KB – clearly a case of good intentions failing.

It wasn’t obvious what my next steps should be – so I bounced the case off the Oak Table … and got the advice to reboot the machine. (What! – it’s not my Windows PC, it’s a T5-2.) The suggestion worked: the instance came up in a few seconds, with shared memory consisting of a few 2GB pages, a fair number of 256MB pages, and a few pages of other sizes (including 8KB, 64KB and 2MB).

There was a little more to the advice than just rebooting, of course; and there was an explanation that fitted the circumstances. The machine was using ZFS and, in the absence of a set limit, the file system cache had at one point managed to acquire 896 GB of memory. In fact when we first tried to start the instance at with a 400GB SGA Oracle couldn’t start up at all until the system administrator had reduced the filesystem cache and freed up most of the memory; even then so much of the memory had been allocated originally in 8KB pages that Oracle had made a complete mess of building a 400GB memory map.

I hadn’t passed all these details to the Oak Table but the justification for the suggested course of action (which came from Tanel Poder) sounded like a perfect description of what had been happening up to that point. In total his advice was:

  • limit the ZFS ARC cache (with two possible strategies suggested)
  • use sga_target instead of memory_target (to avoid a similar problem on memory resize operations)
  • start the instance immediately after the reboot

Maxim: Sometimes the solution you produce after careful analysis of the facts looks exactly like the solution you produce when you can’t think of anything else to do.

October 1, 2014

Shrink Tablespace

Filed under: fragmentation,Infrastructure,Oracle — Jonathan Lewis @ 2:55 pm GMT Oct 1,2014

In a comment on my previous post on shrinking tablespaces Jason Bucata and Karsten Spang both reported problems with small objects that didn’t move to the start of the tablespace. This behaviour is inevitable with dictionary managed tablespaces (regardless of the size of the object), but I don’t think it’s likely to happen with locally managed tablespaces if they’ve been defined with uniform extent sizes. Jason’s comment made me realise, though, that I’d overlooked a feature of system allocated tablespaces that made it much harder to move objects towards the start of file. I’ve created a little demo to illustrate the point.

I created a new tablespace as locally managed, ASSM, and auto-allocate, then created a few tables or various sizes. The following minimal SQL query reports the resulting extents in block_id order, adding in a “boundary_1m” column which subtracts 128 blocks (1MB) from the block_id, then divides by 128 and truncates to show which “User Megabyte” in the file the extent starts in.  (Older versions of Oracle typically have an 8 block space management header, recent versions expanded this from 64KB to 1MB – possibly as a little performance aid to Exadata).


select
        segment_name, block_id, blocks , trunc((block_id - 128)/128) boundary_1M
from
        dba_extents where owner = 'TEST_USER'
order by
        block_id
;

SEGMENT_NAME               BLOCK_ID     BLOCKS BOUNDARY_1M
------------------------ ---------- ---------- -----------
T1                              128       1024           0
T1                             1152       1024           8
T2                             2176       1024          16
T2                             3200       1024          24
T3                             4224          8          32
T4                             4232          8          32
T5                             4352        128          33

As you can see t3 and t4 are small tables – 1 extent of 64KB each – and t5, which I created after t4, starts on the next 1MB boundary. This is a feature of auto-allocate: not only are extents (nearly) fixed to a small number of possible extent sizes, the larger extents are restricted to starting on 1MB boundaries and the 64KB extents are used preferentially to fill in odd-sized” holes. To show the impact of this I’m going to drop table t1 (at the start of file) to make some space.


SEGMENT_NAME               BLOCK_ID     BLOCKS BOUNDARY_1M
------------------------ ---------- ---------- -----------
T2                             2176       1024          16
T2                             3200       1024          24
T3                             4224          8          32
T4                             4232          8          32
T5                             4352        128          33

Now I’ll move table t3 – hoping that it will move to the start of file and use up some of the space left by t1. However there’s a 1MB area (at boundary 32) which is partially used,  so t3 moves into that space rather than creating a new “partly used” megabyte.


SEGMENT_NAME               BLOCK_ID     BLOCKS BOUNDARY_1M
------------------------ ---------- ---------- -----------
T2                             2176       1024          16
T2                             3200       1024          24
T4                             4232          8          32
T3                             4240          8          32
T5                             4352        128          33

It’s a little messy trying to clear up the tiny fragments and make them do what you want. In this case you could, for example, create a dummy table with storage(initial 64K next 64K minextents 14) to use up all the space in the partly used megabyte, then move t3 – which should go to the start of file – then move table t4 – which should go into the first partly-used MB (i.e. start of file) rather than taking up the hole left by t3.

Even for a trivial example it’s messy – imagine how difficult it can get to cycle through building and dropping suitable dummy tables and move objects in the right order when you’ve got objects with several small extents scattered through the file, and objects with a mixture of small extents and large extents.

September 21, 2014

Partitioned Clusters

Filed under: 12c,clusters,Infrastructure,Oracle — Jonathan Lewis @ 7:28 pm GMT Sep 21,2014

In case you hadn’t noticed it, partitioning has finally reached clusters in 12c – specifically 12.1.0.2. They’re limited to hash clusters with range partitioning, but it may be enough to encourage more people to use the technology. Here’s a simple example of the syntax:


create cluster pt_hash_cluster (
        id              number(8,0),
        d_date          date,
        small_vc        varchar2(8),
        padding         varchar2(100)
)
-- single table
hashkeys 10000
hash is id
size 700
partition by range (d_date) (
        partition p2011Jan values less than (to_date('01-Feb-2011','dd-mon-yyyy')),
        partition p2011Feb values less than (to_date('01-Mar-2011','dd-mon-yyyy')),
        partition p2011Mar values less than (to_date('01-Apr-2011','dd-mon-yyyy'))
)
;

I’ve been waiting for them to appear ever since 11.2.0.1 and the TPC-C benchmark that Oracle did with them – they’ve been a long time coming (check the partition dates – that gives you some idea of when I wrote this example).

Just to add choice (a.k.a. confusion) 12.1.0.2 has also introduce attribute clustering so you can cluster data in single tables without creating clusters – but only while doing direct path loads or table moves. The performance intent is similar, though the technology and circumstances of use are different.

September 19, 2014

Shrink Tablespace

Filed under: fragmentation,Infrastructure,Oracle — Jonathan Lewis @ 12:10 pm GMT Sep 19,2014

A recent question on the OTN database forum raised the topic of returning free space in a tablespace to the operating system by rebuilding objects to fill the gaps near the start of files and leave the empty space at the ends of files so that the files could be resized downwards.

This isn’t a process that you’re likely to need frequently, but I have written a couple of notes about it, including a sample query to produce a map of the free and used space in a tablespace. While reading the thread, though, it crossed my mind that recent versions of Oracle introduced a feature that can reduce the amount of work needed to get the job done, so I thought I’d demonstrate the point here.

When you move a table its indexes become unusable and have to be rebuilt; but when an index becomes unusable, the more recent versions of Oracle will drop the segment. Here’s a key point – if the index becomes unusable because the table has been moved the segment is dropped only AFTER the move has completed. Pause a minute for thought and you realise that the smart thing to do before you move a table is to make its indexes unusable so that they release their space BEFORE you move the table. (This strategy is only relevant if you’ve mixed tables and indexes in the same tablespace and if you’re planning to do all your rebuilds into the same tablespace rather than moving everything into a new tablespace.)

Here are some outputs demonstrating the effect in a 12.1.0.2 database. I have created (and loaded) two tables in a tablespace of 1MB uniform extents, 8KB block size; then I’ve created indexes on the two tables. Running my ts_hwm.sql script I get the following results for that tablespace:


FILE_ID    BLOCK_ID   END_BLOCK OWNER      SEGMENT_NAME    SEGMENT_TYPE
------- ----------- ----------- ---------- --------------- ------------------
      5         128         255 TEST_USER  T1              TABLE
                256         383 TEST_USER  T2              TABLE
                384         511 TEST_USER  T1_I1           INDEX
                512         639 TEST_USER  T2_I1           INDEX
                640      65,535 free       free

Notice that it was a nice new tablespace, so I can see the two tables followed by the two indexes at the start of the tablespaces. If I now move table t1 and re-run the script this is what happens:


alter table t1 move;

FILE_ID    BLOCK_ID   END_BLOCK OWNER      SEGMENT_NAME    SEGMENT_TYPE
------- ----------- ----------- ---------- --------------- ------------------
      5         128         255 free       free
                256         383 TEST_USER  T2              TABLE
                384         511 free       free
                512         639 TEST_USER  T2_I1           INDEX
                640         767 TEST_USER  T1              TABLE
                768      65,535 free       free

Table t1 is now situated past the previous tablespace highwater mark and I have two gaps in the tablespace where t1 and the index t1_i1 used to be.

Repeat the experiment from scratch (drop the tables, purge, etc. to empty the tablespace) but this time mark the index unusable before moving the table and this is what happens:


FILE_ID    BLOCK_ID   END_BLOCK OWNER      SEGMENT_NAME    SEGMENT_TYPE
------- ----------- ----------- ---------- --------------- ------------------
      5         128         255 free       free
                256         383 TEST_USER  T2              TABLE
                384         511 TEST_USER  T1              TABLE
                512         639 TEST_USER  T2_I1           INDEX
                640      65,535 free       free

Table t1 has moved into the space vacated by index t1_i1, so the tablespace highwater mark has not moved up.

If you do feel the need to reclaim space from a tablespace by rebuilding objects, you can find that it’s quite hard to decide the order in which the objects should be moved/rebuilt to minimise the work you (or rather, Oracle) has to do; if you remember that any table you move will release its index space anyway and insert a step to mark those indexes unusable before you move the table you may find it’s much easier to work out a good order for moving the tables.

Footnote: I appreciate that some readers may already take advantage of the necessity of rebuilding indexes by dropping indexes before moving tables – but I think it’s a nice feature that we can now make them unusable and get the same benefit without introducing a risk of error when using a script to recreate an index we’ve dropped.

 

September 8, 2014

ASSM Truncate.

Filed under: ASSM,Infrastructure,Oracle,Troubleshooting — Jonathan Lewis @ 11:34 am GMT Sep 8,2014

Here’s one that started off with a tweet from Kevin Closson, heading towards a finish that shows some interesting effects when you truncate large objects that are using ASSM. To demonstrate the problem I’ve set up a tablespace using system allocation of extents and automatic segment space management (ASSM).  It’s the ASSM that causes the problem, but it requires a mixture of circumstances to create a little surprise.


create
	tablespace test_8k_auto_assm
	datafile	-- OMF
	SIZE 1030M
	autoextend off
	blocksize 8k
	extent management local
	autoallocate
	segment space management auto
;

create table t1 (v1 varchar2(100)) pctfree 99 tablespace test_8k_auto_assm storage(initial 1G);

insert into t1 select user from dual;
commit;

alter system flush buffer_cache;

truncate table t1;

I’ve created a table with an initial definition of 1GB, which means that (in a clean tablespace) the autoallocate option will jump straight to extents of 64MB, with 256 table blocks mapped per bitmap block for a total of 32 bitmap blocks in each 64MB extent. Since I’m running this on 11.2.0.4 and haven’t included “segment creation immediate” in the definition I won’t actually see any extents until I insert the first row.

So here’s the big question – when I truncate this table (using the given command) how much work will Oracle have to do ?

Exchanging notes over twitter (140 char at a time) and working from a model of the initial state, it took a little time to get to understand what was (probably) happening and then produce this silly example – but here’s the output from a snapshot of v$session_event for the session across the truncate:


Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
local write wait                                    490           0          83.26        .170          13
enq: RO - fast object reuse                           2           0         104.90      52.451         105
db file sequential read                              47           0           0.05        .001           0
db file parallel read                                 8           0           0.90        .112           0
SQL*Net message to client                            10           0           0.00        .000           0
SQL*Net message from client                          10           0           0.67        .067         153
events in waitclass Other                             2           0           0.04        .018         109

The statistic I want to highlight is the number recorded against “local write wait”: truncating a table of one row we wait for 490 blocks to be written! We also have 8 “db file parallel read”  waits which, according to a 10046 trace file, were reading hundreds of blocks. (I think the most significant time in this test – the RO enqueue wait – may have been waiting for the database writer to complete the work needed for an object checkpoint, but I’m not sure of that.)

The blocks written were the space management bitmap blocks for the extent(s) that remained after the truncate – even the ones that referenced extents above the high water mark for the table. Since we had set the tables initial storage to 1GB, we had a lot of bitmap blocks. At 32 per extent and 16 extents (64MB * 16 = 1GB) we might actually expect something closer to 512 blocks, but actually Oracle had formatted the last extent with only 8 space management blocks. and the first extent had an extra 2 to cater for the level 2 bitmap lock and segment header block giving: 32 * 15 + 8 + 2 = 490.

As you may have seen above, the impact on the test that Kevin was doing was quite dramatic – he had set the initial storage to 128GB (lots of bitmap blocks), partitioned the table (more bitmap blocks) and was running RAC (so the reads were running into waits for global cache grants).

I had assumed that this type of behaviour happened only with the “reuse storage” option of the truncate command: and I hadn’t noticed before that it also appeared even if you didn’t reuse storage – but that’s probably because the effect applies only to the bit you keep, which may typically mean a relatively small first extent. It’s possible, then, that in most cases this is an effect that isn’t going to be particularly visible in production systems – but if it is, can you work around it ? Fortunately another tweeter asked the question “What happens if you ‘drop all storage?'” Here’s the result from adding that clause to my test case:


Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: RO - fast object reuse                           1           0           0.08        .079           0
log file sync                                         1           0           0.03        .031           0
db file sequential read                              51           0           0.06        .001           0
SQL*Net message to client                            10           0           0.00        .000           0
SQL*Net message from client                          10           0           0.56        .056         123
events in waitclass Other                             3           0           0.87        .289         186


Looking good – if you don’t keep any extents you don’t need to make sure that their bitmaps are clean. (The “db file sequential read” waits are almost all about the data dictionary, following on from my “flush buffer cache”).

Footnote 1: the same effect appears in 12.1.0.2
Footnote 2: it’s interesting to note that the RO enqueue wait time seems to parallel the local write wait time: perhaps a hint that there’s some double counting going on. (To be investigated, one day).

August 19, 2014

LOB Length

Filed under: Infrastructure,LOBs,Oracle,Performance — Jonathan Lewis @ 6:06 pm GMT Aug 19,2014

It’s funny how you can make little savings in work all over the place in Oracle if you’re prepared to look a little closely at what’s going on. Here’s a quirky little example with LOBs and function calls that might just have some greater relevance in other situations. Here’s a little data set, and two queries that I might run against it:


create table tbl(
	c1      clob
)
lob (c1) store as c_lob(
	disable storage in row
	nocache nologging
)
;

begin
	for i in 1..128 loop
		insert into tbl values(rpad('x',4000));
		commit;
	end loop;
end;
/

commit;

-- collect stats, prime dictionary cache and library cache
-- run both queries twice and check stats on second run

select
	round(sum(ceil(len/8100)) * 8/1024,0)    used_mb
from
	(
	select
		/*+ no_merge */
		dbms_lob.getlength(c1) len
	from
		tbl
	)
where
	len > 3960
;

select
	round(sum(ceil(len/8100)) * 8/1024,0)    used_mb
from
	(
	select
		rownum rn, dbms_lob.getlength(c1) len
	from
		tbl
	)
where
	len > 3960
;

The question that you might ask yourselves when you see these queries is: will they do similar amounts of work. Of course, I wouldn’t be asking the question if the answer were yes. Despite the no_merge() hint, which you might think would have the same effect as the rownum approach, Oracle seems to execute the call to dbms_lob.getlength() twice for each row in the first query, but only once per row for the second query. Here are the stats (from autotrace) on the second run of the two queries when autotrace is enabled:


Statistics (for no_merge)
----------------------------------------------------------
         40  recursive calls
          0  db block gets
        271  consistent gets
          0  physical reads
          0  redo size
        541  bytes sent via SQL*Net to client
        544  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Statistics (for rownum)
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        131  consistent gets
          0  physical reads
          0  redo size
        541  bytes sent via SQL*Net to client
        544  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

As you can see, the consistent gets for the no_merge() approach is roughly double that for the rownum approach – and since we have 128 rows/LOBs in the table that looks suspiciously like 2 gets vs. 1 get per LOB depending on the approach – which suggests two calls to the function. This is further corroborated by the execution plans, and especially by the predicate sections (how often have I said “always check the predicates”) which show that the predicate has been pushed inside the view that’s been hinted to be non-mergeable, but it hasn’t been pushed inside the view that uses the rownum instantion trick:


Execution Plan for no_merge()
----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |    13 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |    13 |            |          |
|   2 |   VIEW              |      |     6 |    78 |     2   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL| TBL  |     6 |   522 |     2   (0)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("DBMS_LOB"."GETLENGTH"("C1")>3960)

Execution Plan for rownum
-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     1 |    13 |     2   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE      |      |     1 |    13 |            |          |
|*  2 |   VIEW               |      |   128 |  1664 |     2   (0)| 00:00:01 |
|   3 |    COUNT             |      |       |       |            |          |
|   4 |     TABLE ACCESS FULL| TBL  |   128 | 11136 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("LEN">3960)

My first thought on seeing this difference was to apply the /*+ no_push_pred */ hint to block predicate pushing – but then I remembered that the hint is about join predicate pushing and this is a simple filter push. A quick search of the hidden parameters, though, revealed this:

_optimizer_filter_pushdown : enable/disable filter predicate pushdown

Setting this parameter to false – either through a call to ‘alter session’ or through an /*+ opt_param( opt_param(‘_optimizer_filter_pushdown’ , ‘false’) */ hint – allowed the no_merge approach to produce the same plan and resource usage as the rownum approach. Of course, for a production system, I’d probably use the rownum approach rather than mess around with hidden parameters.

Footnote:

I don’t know why the code with the no_merge() approach reported 40 recursive calls (on its first execution with autotrace). A couple of variations on the experiment suggested that it had something to do with the number of rows (or consequential buffer visits) that survived the predicate call – for a sufficiently small number of rows the recursive call count happened to drop to zero; but the phenomenon needs further investigation.

July 13, 2014

Deferrable RI – 2

Filed under: Infrastructure,Locks,Oracle,Troubleshooting — Jonathan Lewis @ 7:46 pm GMT Jul 13,2014

A question came up on Oracle-L recently about possible locking anomalies with deferrable referential integrity constraints.

An update by primary key is taking a long time; the update sets several columns, one of which is the child end of a referential integrity constraint. A check on v$active_session_history shows lots of waits for “enq: TX – row lock contention” in mode 4 (share), and many of these waits also identify the current object as the index that has been created to avoid the “foreign key locking” problem on this constraint (though many of the waits show the current_obj# as -1). A possible key feature of the issue is that foreign key constraint is defined as “deferrable initially deferred”. The question is, could such a constraint result in TX/4 waits.

My initial thought was that if the constraint was deferrable it was unlikely, there would have to be other features coming into play.

Of course, when the foreign key is NOT deferrable it’s easy to set up cases where a TX/4 appears: for example you insert a new parent value without issuing a commit then I insert a new matching child, at that point my session will wait for your session to commit or rollback. If you commit my insert succeeds if you rollback my session raises an error (ORA-02291: integrity constraint (schema_name.constraint_name) violated – parent key not found). But if the foreign key is deferred the non-existence (or potential existence, or not) of the parent should matter.  If the constraint is deferrable, though, the first guess would be that you could get away with things like this so long as you fixed up the data in time for the commit.

I was wrong. Here’s a little example:


create table parent (
	id	number(4),
	name	varchar2(10),
	constraint par_pk primary key (id)
)
;

create table child(
	id_p	number(4)
		constraint chi_fk_par
		references parent
		deferrable initially deferred,
	id	number(4),
	name	varchar2(10),
	constraint chi_pk primary key (id_p, id)
)
;

insert into parent values (1,'Smith');
insert into parent values (2,'Jones');

insert into child values(1,1,'Simon');
insert into child values(1,2,'Sally');

insert into child values(2,1,'Jack');
insert into child values(2,2,'Jill');

commit;

begin
	dbms_stats.gather_table_stats(user,'parent');
	dbms_stats.gather_table_stats(user,'child');
end;
/

pause Press return

update child set id_p = 3 where id_p = 2 and id = 2;

If you don’t do anything after the pause and before the insert then the update will succeed – but fail on a subsequent commit unless you insert parent 3 before committing. But if you take advantage of the pause to use another session to insert parent 3 first, the update will then hang waiting for the parent insert to commit or rollback – and what happens next may surprise you. Basically the deferrability doesn’t protect you from the side effects of conflicting transactions.

The variations on what can happen next (insert the parent elsewhere, commit or rollback) are interesting and left as an exercise.

I was slightly surprised to find that I had had a conversation about this sort of thing some time ago, triggered by a comment to an earlier post. If you want to read a more thorough investigation of the things that can happen and how deferrable RI works then there’s a good article at this URL.

 

July 2, 2014

Comparisons

Filed under: Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 5:09 pm GMT Jul 2,2014

Catching up (still) from the Trivadis CBO days, here’s a little detail which had never crossed my mind before.


where   (col1, col2) < (const1, const2)

This isn’t a legal construct in Oracle SQL, even though it’s legal in other dialects of SQL. The logic is simple (allowing for the usual anomaly with NULL): the predicate should evaluate to true if (col1 < const1), or if (col1 = const1 and col2 < const2). The thought that popped into my mind when Markus Winand showed a slide with this predicate on it – and then pointed out that equality was the only option that Oracle allowed for multi-column operators – was that, despite not enabling the syntax, Oracle does implement the mechanism.

If you’re struggling to think where, it’s in multi-column range partitioning: (value1, value2) belongs in the partition with high value (k1, k2) if (value1 < k1) or if (value1 = k1 and value2 < k2).

Next Page »

The Rubric Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 4,746 other followers