Oracle Scratchpad

June 17, 2015

Reverse Key

Filed under: Indexing,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:11 pm BST Jun 17,2015

A question came up on the OTN database forum recently asking if you could have a partitioned index on a non-partitioned table.

(Aside: I’m not sure whether it would be quicker to read the manuals or try the experiment – either would probably be quicker than posing the question to the forum. As so often happens in these RTFM questions the OP didn’t bother to acknowledge any of the responses)

The answer to the question is yes – you can create a globally partitioned index, though if it uses range partitioning you have to specify a MAXVALUE partition. The interesting thing about the question, though is that several people tried to guess why it had been asked and then made suggestions based on the most likely guess (and wouldn’t it have been nice to see some response from the OP ). The common guess was that there was a performance problem with the high-value block of a sequence-based (or time-based) index – a frequent source of “buffer busy wait” events and other nasty side effects.

Unfortunately too many people suggesting reverse key as a solution to this “right-hand” problem. If you’re licensed for partitioning it’s almost certain that a better option would simple be to use global hash partitioning (with 2^N for some N) partitions. Using reverse keys can result in a bigger performance problem than the one you’re trying to avoid – you may end up turning a little time spent on buffer busy waits into a large amount of time spent on db file sequential reads. To demonstrate the issue I’ve created a sample script – and adjusted my buffer cache down to the appropriate scale:

create table t1(
	id	not null
)
nologging
as
with generator as (
	select	--+ materialize
		rownum id 
	from dual 
	connect by 
		rownum <= 1e4
)
select
	1e7 + rownum	id
from
	generator	v1,
	generator	v2
where
	rownum <= 1e7 
;

begin
	dbms_stats.gather_table_stats(
		ownname		 => user,
		tabname		 => 'T1'
	);
end;
/

alter table t1 add constraint t1_pk primary key(id) 
using index 
	reverse 
	nologging 
;

alter system flush buffer_cache;
alter session set events '10046 trace name context forever, level 8';

begin
	for i in 20000001..20010000 loop
		insert into t1 values(i);
	end loop;
end;
/

I’ve created a table with 10,000,000 rows using a sequential value as the primary key, then inserted “the next” 10,000 rows into the table in order. The index occupied about about 22,000 blocks, so to make my demonstration show you the type of effect you could get from a busy production system with more tables and many indexes I ran my test with the buffer cache limited to 6,000 blocks – a fair fraction of the total index size. Here’s a small section of the trace file from the test running 10.2.0.3 on an elderly machine:


WAIT #43: nam='db file sequential read' ela= 13238 file#=6 block#=12653 blocks=1 obj#=63623 tim=3271125590
WAIT #43: nam='db file sequential read' ela=  7360 file#=6 block#=12749 blocks=1 obj#=63623 tim=3271133150
WAIT #43: nam='db file sequential read' ela=  5793 file#=6 block#=12844 blocks=1 obj#=63623 tim=3271139110
WAIT #43: nam='db file sequential read' ela=  5672 file#=6 block#=12940 blocks=1 obj#=63623 tim=3271145028
WAIT #43: nam='db file sequential read' ela= 15748 file#=5 block#=13037 blocks=1 obj#=63623 tim=3271160998
WAIT #43: nam='db file sequential read' ela=  8080 file#=5 block#=13133 blocks=1 obj#=63623 tim=3271169314
WAIT #43: nam='db file sequential read' ela=  8706 file#=5 block#=13228 blocks=1 obj#=63623 tim=3271178240
WAIT #43: nam='db file sequential read' ela=  7919 file#=5 block#=13325 blocks=1 obj#=63623 tim=3271186372
WAIT #43: nam='db file sequential read' ela= 15553 file#=6 block#=13549 blocks=1 obj#=63623 tim=3271202115
WAIT #43: nam='db file sequential read' ela=  7044 file#=6 block#=13644 blocks=1 obj#=63623 tim=3271209420
WAIT #43: nam='db file sequential read' ela=  6062 file#=6 block#=13741 blocks=1 obj#=63623 tim=3271215648
WAIT #43: nam='db file sequential read' ela=  6067 file#=6 block#=13837 blocks=1 obj#=63623 tim=3271221887
WAIT #43: nam='db file sequential read' ela= 11516 file#=5 block#=13932 blocks=1 obj#=63623 tim=3271234852
WAIT #43: nam='db file sequential read' ela=  9295 file#=5 block#=14028 blocks=1 obj#=63623 tim=3271244368
WAIT #43: nam='db file sequential read' ela=  9466 file#=5 block#=14125 blocks=1 obj#=63623 tim=3271254002
WAIT #43: nam='db file sequential read' ela=  7704 file#=5 block#=14221 blocks=1 obj#=63623 tim=3271261991
WAIT #43: nam='db file sequential read' ela= 16319 file#=6 block#=14444 blocks=1 obj#=63623 tim=3271278492
WAIT #43: nam='db file sequential read' ela=  7416 file#=6 block#=14541 blocks=1 obj#=63623 tim=3271286129
WAIT #43: nam='db file sequential read' ela=  5748 file#=6 block#=14637 blocks=1 obj#=63623 tim=3271292163
WAIT #43: nam='db file sequential read' ela=  7131 file#=6 block#=14732 blocks=1 obj#=63623 tim=3271299489
WAIT #43: nam='db file sequential read' ela= 16126 file#=5 block#=14829 blocks=1 obj#=63623 tim=3271315883
WAIT #43: nam='db file sequential read' ela=  7746 file#=5 block#=14925 blocks=1 obj#=63623 tim=3271323845
WAIT #43: nam='db file sequential read' ela=  9208 file#=5 block#=15020 blocks=1 obj#=63623 tim=3271333239
WAIT #43: nam='db file sequential read' ela=  7708 file#=5 block#=15116 blocks=1 obj#=63623 tim=3271341141
WAIT #43: nam='db file sequential read' ela= 15484 file#=6 block#=15341 blocks=1 obj#=63623 tim=3271356807
WAIT #43: nam='db file sequential read' ela=  5488 file#=6 block#=15437 blocks=1 obj#=63623 tim=3271362623
WAIT #43: nam='db file sequential read' ela= 10447 file#=6 block#=15532 blocks=1 obj#=63623 tim=3271373342
WAIT #43: nam='db file sequential read' ela= 12565 file#=6 block#=15629 blocks=1 obj#=63623 tim=3271386741
WAIT #43: nam='db file sequential read' ela= 17168 file#=5 block#=15725 blocks=1 obj#=63623 tim=3271404135
WAIT #43: nam='db file sequential read' ela=  7542 file#=5 block#=15820 blocks=1 obj#=63623 tim=3271411882
WAIT #43: nam='db file sequential read' ela=  9400 file#=5 block#=15917 blocks=1 obj#=63623 tim=3271421514
WAIT #43: nam='db file sequential read' ela=  7804 file#=5 block#=16013 blocks=1 obj#=63623 tim=3271429519
WAIT #43: nam='db file sequential read' ela= 14470 file#=6 block#=16237 blocks=1 obj#=63623 tim=3271444168
WAIT #43: nam='db file sequential read' ela=  5788 file#=6 block#=16333 blocks=1 obj#=63623 tim=3271450154
WAIT #43: nam='db file sequential read' ela=  9630 file#=6 block#=16429 blocks=1 obj#=63623 tim=3271460008
WAIT #43: nam='db file sequential read' ela= 10910 file#=6 block#=16525 blocks=1 obj#=63623 tim=3271471174
WAIT #43: nam='db file sequential read' ela= 15683 file#=5 block#=16620 blocks=1 obj#=63623 tim=3271487065
WAIT #43: nam='db file sequential read' ela=  8094 file#=5 block#=16717 blocks=1 obj#=63623 tim=3271495454
WAIT #43: nam='db file sequential read' ela=  6670 file#=5 block#=16813 blocks=1 obj#=63623 tim=3271502293
WAIT #43: nam='db file sequential read' ela=  7852 file#=5 block#=16908 blocks=1 obj#=63623 tim=3271510360
WAIT #43: nam='db file sequential read' ela= 10500 file#=6 block#=17133 blocks=1 obj#=63623 tim=3271521039
WAIT #43: nam='db file sequential read' ela= 11038 file#=6 block#=17229 blocks=1 obj#=63623 tim=3271532275
WAIT #43: nam='db file sequential read' ela= 12432 file#=6 block#=17325 blocks=1 obj#=63623 tim=3271544974
WAIT #43: nam='db file sequential read' ela=  7784 file#=6 block#=17421 blocks=1 obj#=63623 tim=3271553331
WAIT #43: nam='db file sequential read' ela=  7774 file#=5 block#=17517 blocks=1 obj#=63623 tim=3271561346
WAIT #43: nam='db file sequential read' ela=  6583 file#=5 block#=17613 blocks=1 obj#=63623 tim=3271568146
WAIT #43: nam='db file sequential read' ela=  7901 file#=5 block#=17708 blocks=1 obj#=63623 tim=3271576231
WAIT #43: nam='db file sequential read' ela=  6667 file#=5 block#=17805 blocks=1 obj#=63623 tim=3271583259
WAIT #43: nam='db file sequential read' ela=  9427 file#=6 block#=18029 blocks=1 obj#=63623 tim=3271592988
WAIT #43: nam='db file sequential read' ela= 52334 file#=6 block#=18125 blocks=1 obj#=63623 tim=3271646055
WAIT #43: nam='db file sequential read' ela= 50512 file#=6 block#=18221 blocks=1 obj#=63623 tim=3271697284
WAIT #43: nam='db file sequential read' ela= 10095 file#=6 block#=18317 blocks=1 obj#=63623 tim=3271708095

Check the block number for this list of single block reads – we’re jumping through the index about 100 blocks at a time to read the next block where an index entry has to go. The jumps are the expected (and designed) effect of reverse key indexes: the fact that the jumps turn into physical disc reads is the (possibly unexpected) side effect. Reversing an index makes adjacent values look very different (by reversing the bytes) and go to different index leaf blocks: the purpose of the exercise is to scatter concurrent similar inserts across multiple blocks, but if you scatter the index entries you need to buffer a lot more of the index to keep the most recently used values in memory. Reversing the index may eliminate buffer busy waits, but it may increase time lost of db file sequential reads dramatically.

Here’s a short list of interesting statistics from this test – this time running on 11.2.0.4 on a machine with SSDs) comparing the effects of reversing the index with those of not reversing the index – normal index first:


Normal index
------------
CPU used by this session               83
DB time                                97
db block gets                      40,732
physical reads                         51
db block changes                   40,657
redo entries                       20,174
redo size                       5,091,436
undo change vector size         1,649,648

Repeat with reverse key index
-----------------------------
CPU used by this session              115
DB time                               121
db block gets                      40,504
physical reads                     10,006
db block changes                   40,295
redo entries                       19,973
redo size                       4,974,820
undo change vector size         1,639,232

Because of the SSDs there’s little difference in timing between the two sets of data and, in fact, all the other measures of work done are very similar except for the physical read, and the increase in reads is probably the cause of the extra CPU time thanks to both the LRU manipulation and the interaction with the operating system.

If you want to check the effect of index reversal you can take advantage of the sys_op_lbid() function to sample a little of your data – in my case I’ve queried the last 10,000 rows (values) in the table:


select 
	/*+ 
		cursor_sharing_exact 
		dynamic_sampling(0) 
		no_monitoring 
		no_expand 
		index_ffs(t1,t1_i1) 
		noparallel_index(t,t1_i1) 
	*/ 
	count (distinct sys_op_lbid( &m_ind_id ,'L',t1.rowid)) as leaf_blocks
from 
	t1
where 
	id between 2e7 + 1 and 2e7 + 1e4
;

The &m_ind_id substition variable is the object_id of the index t1_i1.

In my case, with an index of 22,300 leaf blocks, my 10,000 consecutive values were scattered over 9,923 leaf blocks. If I want access to “recent data” to be as efficient as possible I need to keep that many blocks of the index cached, compared to (absolute) worst case for my data 100 leaf blocks. When you reverse key an index you have to think about how much bigger you have to make your buffer cache to keep the performance constant.

12 Comments »

  1. So if I understand correctly, there are either fewer blocks with more contention per block or more blocks with less contention per block – but if there are more blocks we need room for more blocks.

    With hindsight, that makes sense…

    Comment by stewashton — June 17, 2015 @ 3:27 pm BST Jun 17,2015 | Reply

    • Stew,

      That’s correct – that’s why ASSM was introduced for tables, it avoids the problem of DBAs having to work out a suitable value for freelists with a relatively small physical change to the general pattern of data placement; but reverse key indexes take things to an extreme.

      Comment by Jonathan Lewis — June 19, 2015 @ 9:41 am BST Jun 19,2015 | Reply

  2. Hi Jonathan,
    In 12c there is a better solution to avoid sequence-based index contention – but unfortunately not documented – called partitioned sequences.
    It prefixes sequence values with a hash from the session id. It’s better than a hash from the sequence id because inserts from same session are going to the same block.
    I’ve an example at the end of: http://www.dbi-services.com/index.php/blog/entry/oracle-partitioned-sequences-a-future-new-feature-in-12c (search for ‘alter sequence’)
    Crossing fingers that we can use it in future versions.
    Regards,
    Franck.

    Comment by @FranckPachot — June 17, 2015 @ 4:48 pm BST Jun 17,2015 | Reply

    • Franck,

      Interesting features, thanks for the link.
      We’ll have to keep an eye on that to see if it ever gets official approval. Again, though, it’s an “easy” solution with potentially expensive side effects that should be considered carefully.

      Comment by Jonathan Lewis — June 19, 2015 @ 9:43 am BST Jun 19,2015 | Reply

      • And a quick test on a 3-node RAC – and instance 3 used the same range of partion values as instance 1.
        So not production-ready yet, it seems.

        Comment by Jonathan Lewis — June 20, 2015 @ 7:36 pm BST Jun 20,2015 | Reply

    • Franck,

      Just did a quick test on a 2-node RAC with partitions 16.

      It LOOKED LIKE the sequence became “nextval + N * 1e10”, where N was 0 – 7 for node 1 and 8 – 15 for node 2, with the actual value N independent of the SID and PID of the session. (Same SID and PID gave difference N over time if you reconnected – perhaps the serial#, logon time, or o/s process id affects the result). Looked like 7 and 15 weren’t used, though.

      Comment by Jonathan Lewis — June 19, 2015 @ 6:42 pm BST Jun 19,2015 | Reply

  3. Hello sir,
    This question is not much relevant to the topic under discussion but I have been looking for this for a long time

    When we say reverse key index makes insert faster, what is the actual reason ??
    Just Because it scatters index entries ,why?
    Consider a scenario
    We have a table with column ‘ABC’ as primary key in rac 4 node set up
    Suppose 10 different users are inserting rows in table at very high rate, because to insert new row we have to buffer it and in rac we have to transfer that block to all 4 node since subsequent rows will be inserted in the same block (we are inserting rows in increasing sequence)….the query will suffer greatly bcoz of transferring the same block across the 4 nodes…how reverse key index help here??
    It surely help in maintain index but for insert we need that block ..not index

    Comment by rajat sharma — June 18, 2015 @ 1:03 pm BST Jun 18,2015 | Reply

    • Rajat,

      I’m not sure I understood the question – are you asking about contention on the TABLE block, or are you asking about performance as we do an index range scan through the index ?

      If the former, remember that ASSM associates table blocks with instances and sessions for inserts, so that the table blocks shouldn’t suffer contention.
      If the latter, remember that (for a single column index) you cannot do an index range scan using a reverse key index so the question would be irrelevant.

      Comment by Jonathan Lewis — June 19, 2015 @ 9:48 am BST Jun 19,2015 | Reply

  4. Hello Jonathan , great topic, I just to add that it seems for range scans we need reverse key. I never tested it but it`s in the Oracle docs.

    Regards,
    Aurelian

    Comment by Aurelian — June 18, 2015 @ 10:26 pm BST Jun 18,2015 | Reply

    • Aurelian,

      If it’s in the Oracle docs and you think it’s worth mentioning and you haven’t tested it, then it would be useful to supply a URL to where in the Oracle Docs you found it.

      Since we can do index range scans without reversing indexes I believe you didn’t type what you meant to type. The problem with reverse key indexes and range-based predicates, and the scope of misunderstanding the threat, is one I’ve highlighted before, though: https://jonathanlewis.wordpress.com/2009/09/15/index-explosion-4/

      Comment by Jonathan Lewis — June 19, 2015 @ 9:57 am BST Jun 19,2015 | Reply

      • Thank you , another great article. This is the line in docs that confused me : “Note that you cannot use index range scans on an index with hash partitioning.”

        Comment by Aurelian — June 19, 2015 @ 1:56 pm BST Jun 19,2015 | Reply

        • Again, no URL, but I did a search for the phrase and found that the text does actually appear in (at least) the 11.2 manual: http://docs.oracle.com/cd/E11882_01/rac.112/e41960/design.htm#RACAD712

          It’s wrong, so I’m not surprised you were confused. I can imagine that it MIGHT have been true when hash partitioning was first introduced; or maybe the author means you won’t automatically see the data in the index order. Here’s a plan from 11.2 with a non-partitioned table and a hash partitioned index:

          
          --------------------------------------------------------------------------------------
          | Id  | Operation                    | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
          --------------------------------------------------------------------------------------
          |   0 | SELECT STATEMENT             |       |    12 |  1260 |     3 |       |       |
          |   1 |  PARTITION HASH ALL          |       |    12 |  1260 |     3 |     1 |     4 |
          |   2 |   TABLE ACCESS BY INDEX ROWID| T1    |    12 |  1260 |     3 |       |       |
          |*  3 |    INDEX RANGE SCAN          | T1_I1 |    12 |       |     2 |     1 |     4 |
          --------------------------------------------------------------------------------------
          
          Predicate Information (identified by operation id):
          ---------------------------------------------------
             3 - access("OBJECT_ID">=1000 AND "OBJECT_ID"<=1010)
          
          

          It could be fairly easy to miss the difference after recreating the table as a hash partitioned table with local index.

          
          --------------------------------------------------------------------------------------------
          | Id  | Operation                          | Name  | Rows  | Bytes | Cost  | Pstart| Pstop |
          --------------------------------------------------------------------------------------------
          |   0 | SELECT STATEMENT                   |       |    12 |  1260 |     3 |       |       |
          |   1 |  PARTITION HASH ALL                |       |    12 |  1260 |     3 |     1 |     4 |
          |   2 |   TABLE ACCESS BY LOCAL INDEX ROWID| T1    |    12 |  1260 |     3 |     1 |     4 |
          |*  3 |    INDEX RANGE SCAN                | T1_I1 |    12 |       |     2 |     1 |     4 |
          --------------------------------------------------------------------------------------------
          
          Predicate Information (identified by operation id):
          ---------------------------------------------------
             3 - access("OBJECT_ID">=1000 AND "OBJECT_ID"<=1010)
          
          
          

          Comment by Jonathan Lewis — June 19, 2015 @ 4:44 pm BST Jun 19,2015


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.