Oracle Scratchpad

August 5, 2022

drop partition

Filed under: Indexing,Infrastructure,Oracle — Jonathan Lewis @ 8:24 pm BST Aug 5,2022

This note is about some testing I did on the consequences of the (new in 12c) “deferred global index maintenance” feature that Oracle introduced as a strategy to reduce the impact of dropping partitions from a partitioned table.

Looking at my notes I see that I created my first test in August 2013 on Oracle 12.1.0.1 – probably after reading Richard Foote’s series on the topic.

At the time I didn’t turn my notes into a blog post but a recent request on the MOS Community Forum (needs a MOS account) prompted me to revisit and extend the tests using 19c.

  1. The Request
  2. The Background
  3. The Model
  4. Tests and Results
  5. Deep Dive
    1. dbms_space
    2. Tree dumps
    3. Dumping Redo
    4. Transactions and locking
  6. Summary
  7. Footnote

The Request

The database is 19.3, two-node RAC with a standby (type and function not specified). There is a table range-partitioned by month holding nearly three years of data. The table size is about 250GB with indexes totalling a further 250GB, and the OP wants to drop the partitions older than one year.

There was an issue doing this on some other environment when running the daily maintenance windows described as: “it consumes a lot of CPU but I could not find a link between both activities”.

Is there anything that could be done to avoid any db impact especially as this is a “production 24” environment?

There was no comment about how many indexes there were (and that’s an important detail), nor how many were global, globally partitioned, or local (also important details) [Ed: information later supplied – 14 indexes, all global], but there was a comment that the pmo_deferred_gidx_maint_job was run immediately after the drop, generated a lot of redo, and was still running after 10 hours – so it’s reasonably safe to assume that global indexes had a big impact since that’s the “partition maintenance operations deferred global index maintenance” job.

From the comment about the system being “production 24” I assume that the target is to come up with a strategy that doesn’t deny access to the users for a few hours, has the least possible impact on what they normally do, and doesn’t require the standby database to be (partially) rebuilt / unavailable.

Since this is Oracle 19c and the OP wants to drop nearly two-thirds of the data (i.e. significantly more than he’s keeping) the “obvious” strategy to investigate is dropping the partitions (online, with update indexes) then rebuilding the global (or globally partitioned) indexes (online).

At a minimum it would be sensible to do some modelling to get some idea of why the other system spent so much time in pmo_deferred_gidx_maint_job as this might allow you to work out either that it wouldn’t be a problem in this system, or that there was a variation on the method that would be better, or that you just don’t want to use the job because you’ve got a good idea of just how nasty it would be.

The Background

Deferred index maintenance means that global index maintenance does not take place when you drop a partition. Historically Oracle would, as part of the drop, delete every single index entry for the dropped partition from every single global index – doing a lot of work and taking a lot of time at the moment of the drop. Deferred maintenance means Oracle simply notes which object_ids no longer exist and then, when reading through a global index, ignores index entries where the rowid includes the object_id of a dropped object.

Note: the rowid stored in a B-tree index for a global index is made up of 4 components that require a total of 10 bytes of storage. In order of appearance these are: (object_id of table partition, tablespace-relative file_number, block_id, row number within block). For a local index or index on a simple heap table the object_id of the table can be inferred from the identity of the index so it is not stored and the rowid takes only 6 bytes of storage.

So the benefit of deferred maintenance is that dropping a partition takes virtually no time at all, but (a) Oracle has to clean up the garbage at some point and (b) until the garbage has been cleaned up it has to be read before it can be ignored.

A thought about the second point – if Oracle can check for dropped object_ids very efficiently then it doesn’t necessarily matter that you haven’t cleaned out the garbage. The continued presence of the “dropped” index entries won’t make your application run more slowly , it’s just that you won’t have achieved the (possible) benefit of a smaller index that might allow the application to run a little faster.

[Ed: see this comment from Mikhail Veilikikh, though, and my replies – there is an optimizer anomaly that means a specific optimizer feature may “disappear” from an index with orphans]

So here’s a hypothesis to explain why the OP’s previous experience of deferred maintenance was very slow : if you update global indexes in real time Oracle does that job as efficiently as possible because it can use key values and rowids from the table segment that it’s dropping to create a “delete array” for the index, which you used to detect in the sorts (rows) session statistics and a strange “insert” statement if you traced the operation:

insert 
        /*+ RELATIONAL("T1") NO_PARALLEL APPEND NESTED_TABLE_SET_SETID NO_REF_CASCADE */   
        into "TEST_USER"."T1"  partition ("P09000") 
select 
        /*+ RELATIONAL("T1") NO_PARALLEL  */ a
        *  
from    NO_CROSS_CONTAINER ( "TEST_USER"."T1" ) partition ("P09000")  
        delete global indexes

If you defer the maintenance Oracle has to walk through the index in order, one entry at a time, and work out whether or not to delete that entry – and we all know that single-row processing is more expensive than array-processing.

It’s worth noting that there are notes on MOS to support this hypothesis, e.g. Bug 27468233 : ALTER INDEX COALESCE CLEANUP IS GENERATING HUGE AMOUNT OF REDO reports an example of generating 23GB of redo while cleaning up an index of only 1.8GB. (Version 12.2.0.1)

So let’s build a model and do some simple tests.

The Model

I’m going to build a table with 6 (range) partitions and two global indexes. I’ll set up two very different patterns of data for the two indexes to see how much impact the data pattern might have.

I’ll drop the bottom three partitions, then clean up the mess in a variety of ways. There’s the call to the pmo_deferred_gidx_maint_job, which normally runs at 2:00 am daily but can be initiated by a call to dbms_scheduler.run_job; then there’s the dbms_part.cleanup_gidx() procedure that has a couple of options; then there’s a simple call to “alter index coalesce cleanup [only][parallel N]” (which needs improved documentation) and finally, of course, “alter index rebuild [online]”.

For at least some runs of the tests it will be worth enabling SQL_trace to see what happens behind the scenes; and it’s always worth checking the Session Activity Stats – and maybe some activity from some other dynamic performance views as well.

So here’s some code to create the test data set:

rem
rem     Script:         12c_global_index_maintenance_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2022
rem
rem     Last tested:
rem             19.11.0.0
rem

create table t1 
partition by range(id) (
        partition p09000 values less than ( 9000),
        partition p18000 values less than (18000),
        partition p27000 values less than (27000),
        partition p36000 values less than (36000),
        partition p45000 values less than (45000),
        partition p54000 values less than (54000)
)
as
select
        rownum - 1                      id,
        trunc(dbms_random.value(0,600)) n1,
        rpad('x',100)                   padding
from
        all_objects ao
where
        rownum <= 54000
;

create index t1_n1 on t1(n1) pctfree 0;
create index t1_id on t1(id) pctfree 0;

select 
        index_name, num_rows, s.blocks, leaf_blocks, status, orphaned_entries 
from 
        user_indexes i, user_segments s 
where 
        i.index_name = s.segment_name 
and     i.table_name='T1' 
and     partitioned = 'NO'
;

alter table t1 drop partition p09000, p18000, p27000 update global indexes; 

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

select 
        index_name, num_rows, s.blocks, leaf_blocks, status, orphaned_entries 
from 
        user_indexes i, user_segments s 
where 
        i.index_name = s.segment_name 
and     i.table_name='T1' 
and     partitioned = 'NO'
;


For testing purposes I’ve set the index pctfree to 0; and I’ve reported some of the index stats before and after dropping the three partitions so that we can see what the optimizer thinks the indexes look like:

Index size information before drop
==================================
INDEX_NAME             NUM_ROWS     BLOCKS LEAF_BLOCKS STATUS   ORP
-------------------- ---------- ---------- ----------- -------- ---
T1_ID                     54000        256         134 VALID    NO
T1_N1                     54000        256         128 VALID    NO


Index size information after drop
=================================
INDEX_NAME             NUM_ROWS     BLOCKS LEAF_BLOCKS STATUS   ORP
-------------------- ---------- ---------- ----------- -------- ---
T1_ID                     27000        256          68 VALID    YES
T1_N1                     27000        256         128 VALID    YES

Both indexes are valid (which is good for the application) and their segment sizing has not changed. The number of rows has halved in both indexes but the number of (populated) leaf blocks has remained unchanged in one index even though it has halved for the other.

If you dumped a few index leaf blocks the explanation for the changes (and the difference in the changes) would become clear. The number of (non-deleted) index entries in the two indexes is the same, but Oracle is (almost literally) ignoring half of them – the ones that include the object_ids for the original first three table partitions.

The t1_id index is on the (sequential) id and the table is partitioned by id, and we have dropped the partitions that hold (nothing but) the ids less than 27,000 (in earlier versions of Oracle this would have immediately deleted all the index entries from the first half of the index, leaving all the leaf blocks in the 2nd half of the index full) and although the index entries are still in those blocks Oracle is behaving as if they don’t exist, which means it treats the blocks as empty when calculating the leaf_blocks statistic. The t1_n1 index is on integer values from 0 to 599 randomly distributed across the full range of ids, so by dropping the partitions for ids less than 27,000 we (ought to) have deleted the first half of the index entries for n1 = 0, the first half for n1 = 1 and so on – leaving every index leaf block approximately half empty and still available for inclusion in the leaf_blocks count.

How, then, does Oracle manage to “ignore” the rows that would have been deleted in older versions. We can always enable SQL tracing when gathering stats, run tkprof against the trace file, and look for the SQL that Oracle used – and if that doesn’t reveal all, use the sql_id of the relevant statements to pull their plans from memory. Here’s the query (reformatted) and plan for one of the index stats gathering queries that I pulled from memory after finding it and its sql_id in the tkprof output:

SQL> select * from table(dbms_xplan.display_cursor('gtnd3aphdkp3k'));

SQL_ID  gtnd3aphdkp3k, child number 0
-------------------------------------
select /*+ 
                opt_param('_optimizer_use_auto_indexes' 'on')
                no_parallel_index(t, "T1_ID")  
                dbms_stats  cursor_sharing_exact  use_weak_name_resl 
                dynamic_sampling(0)  no_monitoring  xmlindex_sel_idx_tbl 
                opt_param('optimizer_inmemory_aware' 'false')
                no_substrb_pad  no_expand index(t, "T1_ID") 
        */ 
        count(*)                                          as nrw,
        count(distinct sys_op_lbid(130418, 'L', t.rowid)) as nlb,
        null                                              as ndk,
        sys_op_countchg(substrb(t.rowid, 1, 15), 1)       as clf 
from
        "TEST_USER"."T1" t where "ID" is not null

Plan hash value: 4265068335

--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |       |       |   136 (100)|          |
|   1 |  SORT GROUP BY   |       |     1 |    17 |            |          |
|*  2 |   INDEX FULL SCAN| T1_ID | 27000 |   448K|   136   (1)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter((TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T".ROWID)=1 AND "ID" IS NOT NULL))

If you’ve read Cost Based Oracle – Fundamentals you’ll recognise the SQL is typical of the pattern Oracle uses to gather stats on an index, with a couple of sys_op() function calls that dissect rowids to allow Oracle to calculate the number of leaf_blocks in, and clustering_factor of, the index. What’s new, though is the filter() in the Predicate Information that (presumably) is checking that the rowid belongs to a table partition that still exists. (In other circumstances the “<?>” would be the table-name. The value for 8 as the third parameter also appears in queries involving table-expansion with partial indexing.)

Unsurprisingly, if you execute a simple query driven through one of the indexes after dropping partitions you’ll see exactly the same filter() predicate generated for the execution plan for the range scan operation e.g:

SQL_ID  822pfkz83jzhz, child number 0
-------------------------------------
select  /*+ index(t1(n1)) */  id from t1 where n1 = 300

Plan hash value: 2152633691

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    44 (100)|     40 |00:00:00.01 |      41 |
|   1 |  TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| T1    |      1 |     45 |    44   (0)|     40 |00:00:00.01 |      41 |
|*  2 |   INDEX RANGE SCAN                         | T1_N1 |      1 |     45 |     1   (0)|     40 |00:00:00.01 |       3 |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("N1"=300)
       filter(TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1)

The tbl$or$idx$part$num() function is an important item to consider at this point. How much impact will it have on your processing – it’s hard to give a generic answer since it may depend on exactly what your data looks like and whether or not the function can cache its result effectively. It’s also possible that the performance of the function is related to either the number of partitions dropped or the number of partitions still in existence – so that’s a detail that probably has to be tested at the correct scale before you go into production

More significantly, perhaps, is how long that impact is going to be relevant and what savings it has to be balanced against. My thought at this point is that if you drop a partition but don’t clean up the index you reduce your workload by not visiting (possibly cached) table blocks, but you pay for the benefit by calling the function for every index entry you do visit (whether or not it would have required you to visit the table prior to dropping the partition). Maybe it’s okay to leave the index uncleaned for a few days or even a few weeks before you take any steps to clean up the mess; if that’s the case then maybe you can spread a relatively large number of clean-up jobs over a long enough period of time that their impact doesn’t become visible to the users.

Tests and Results

Test number 1:

What does pmo_deferred_gidx_maint_job do?

I had to connect as a dba and grant alter on this job to my normal test user to be able to do the following test. The default role for my test user had also been granted create job, so that might have been needed as well; and you’ll see that I’ve included a number of my typical “v$ snapshot” procedures to measure different aspects of the workload, and I’ve enabled the 10046 trace to see what Oracle does behind the scenes.

alter system flush buffer_cache;

execute snap_enqueues.start_snap
execute snap_events.start_snap
execute snap_my_stats.start_snap
execute snap_redo.start_snap
execute snap_rollstats.start_snap

alter session set events '10046 trace name context forever, level 8';

execute dbms_scheduler.run_job('SYS.PMO_DEFERRED_GIDX_MAINT_JOB',true)

alter session set events '10046 trace name context off';

execute snap_rollstats.end_snap
execute snap_redo.end_snap
execute snap_my_stats.end_snap
execute snap_events.end_snap
execute snap_enqueues.end_snap

The most significant discovery in this test was that the package (ultimately) executed two SQL statements:

ALTER INDEX "TEST_USER"."T1_ID" COALESCE CLEANUP PARALLEL 1
ALTER INDEX "TEST_USER"."T1_N1" COALESCE CLEANUP PARALLEL 1

So really all I needed to do from that point onwards was to worry about investigating variations of the “alter index coalesce” command.

Test number 2:

What does the procedure dbms_part.cleanup_gidx() do?

This procedure takes a schema name and table name as its first inputs with defaults null and has two other parameters called parallel (default 0) and options (default ‘CLEANUP_ORPHANS’); the only other value you can supply for options is ‘COALESCE’. Again the 10046 trace of my wrapper was very useful, as this showed the following SQL when I specified just the schema and table name for the call:

        ALTER INDEX "TEST_USER"."T1_ID" COALESCE CLEANUP ONLY
        ALTER INDEX "TEST_USER"."T1_N1" COALESCE CLEANUP ONLY

The 19c manual does mention the “ONLY” keyword, but doesn’t explain its significance (but I will in a moment). If I re-ran the test with options set to ‘COALESCE’ the SQL statements changed to:

        ALTER INDEX "TEST_USER"."T1_ID" COALESCE CLEANUP
        ALTER INDEX "TEST_USER"."T1_N1" COALESCE CLEANUP

This did more work than the “cleanup only” run (figures in a moment). When I re-ran the tests setting the parallel option to 2 the same SQL statement appeared with “PARALLEL 2”.

So here are some of the most important numbers for the calls to dbms_part.cleanup_gidx(). They are the headline undo and redo figures from the session statistics:

Default: (cleanup_orphans)
==========================
Name                                       Value
----                                       -----
redo entries                              54,283
redo size                             12,412,092
undo change vector size                4,762,432

options=>"Coalesce"
====================
Name                                       Value
----                                       -----
redo entries                              57,427
redo size                             15,518,660
undo change vector size                6,466,264

As you consider those figures, let me remind you that the indexes started out at roughly 1MB each, and we dropped 27,000 rows in three partitions. A quick check of the “cleanup orphans” arithmetic and you can see (with rounding) 54,000 redo entries = (two indexes * 27,000 rows each), and 12.4M redo size / 54K redo entries = 230 bytes per index entry. I’ve highlighted that last result because that’s a number you can use as a baseline to estimate the redo that will be generated by cleaning up global indexes. How many rows will be dropped from the table, how many global indexes have you got on the table – multiply the two together and multiply by 230.

Of course, that’s just for “cleanup only” and it assumes that every row appears in every index (which, in a well engineered system, probably won’t be true). Where does the extra 3MB of redo come from? Let’s drop down one more level in the processing and run explicit “alter index” statements through the test harness.

Test number 3:

What does alter index xxx coalesce cleanup [only] do?

Here are the redo and undo summaries fron two sets of tests – “coalesce cleanup only”, and “coalesce cleanup”

Index t1_n1                         Cleanup only            Cleanup
-----------                         -------------------------------
redo entries                              27,139             28,176
redo size                              6,209,140          8,930,900
undo change vector size                2,382,564          3,999,900

Index t1_id                         Cleanup only            Cleanup
-----------                         -------------------------------
redo entries                              27,144             28,966
redo size                              6,202,436          6,547,460
undo change vector size                2,379,868          2,461,908

For “coalesce cleanup only” the workload for the two indexes is (effectively) identical – it’s basically the undo and redo from marking 27,000 index entries as deleted and doing nothing else. The blocks have not been cleaned up in any way; that task will be left to future sessions that need to insert entries into a leaf block and find that it is full but has lots of space that can be reclaimed from deleted entries.

When we use the “coalesce cleanup” (i.e. without “only”) Oracle does some extra work, but the amount of work varies significantly depending on the nature of the index: t1_n1 generates an extra 2.7MB of redo, index t1_id generates only another 345KB. That may be a little surprising, but we’ve already had a clue that something like this might happen, and since every other strategy for “cleaning” the indexes comes down to running these variations of the coalesce command we should look a little further into what they do and how they work.

To get a complete picture we’ll have to do some work with the dbms_space package, the index treedump command, dumping redo, and we also ought to take a look at v$rollstat and v$enqueue_stat, but we’ll pursue those tasks in the Deep Dive section.

Test number 4:

What does alter index rebuild online do?

There’s a very important point to check in this test – if your database is in noarchivelog mode the rebuild will be nologging. and you’ll be fooled by the apparent efficiency of the mechanism right up to the point where you go to production and find that you’re generating a huge amount of redo. For the record, my indexes were roughly 64 blocks (512KB) each when rebuilt and produced the following redo figures (and virtually no undo):

Index t1_n1
-----------
redo entries                                 343
redo size                                594,984
redo size for direct writes              527,148
undo change vector size                   18,528
sorts (rows)                              27,058

Index t1_id
-----------
redo entries                                 345
redo size                                625,956
redo size for direct writes              560,092
undo change vector size                   18,096
sorts (rows)                              27,022

I’ve included the sorts statistic as a reminder that there are other (potentially nasty) overheads to consider. And when you do an online rebuild Oracle will have to lock the table briefly create a journal table (effectively a materialized view log) to capture the changes that go on while the rebuild is running then apply them when the rebuild is nearly complete, and the rebuild has to be based on a tablescan and sort.

Depending on what fraction of the partitions you are dropping, though, this does look like a very promising option – especially when you have to cater for the problem of shifting the redo to a remote site.

Deep Dive

We have seen how much redo was generated for both “coalesce cleanup” and “coalesce cleanup only” and have an idea that we know what’s happening, so we will be taking a look at some redo dumps to see if that confirms our suspicion. Before we get to that extreme, though, it’s worth taking a couple of simpler steps.

The dbms_space package

We can use the dbms_space.space_usage() procedure to see the state of blocks in the two index segments before and after the attempts to cleanup. It’s important to remember that for index segments the procedure uses the “FS2” state to report blocks that are “free”, i.e. formatted but contain no data, aren’t in the index structure and can be linked in to the index when an existing block needs to be split.

Here are three sets of results from calls to the procedure showing the state immediately after (and before) the partitions were dropped, then after a “coalesce cleanup only test, and then after a “coalesce cleanup” test.

Index T1_N1                   Blocks       Bytes    |  After "Cleanup Only" |   After "Cleanup" 
----------------------------------------------------|-----------------------|-------------------
Unformatted                   :    0 /         0    |      0 /         0    |      0 /         0
Freespace 1 (  0 -  25% free) :    0 /         0    |      0 /         0    |      0 /         0
Freespace 2 ( 25 -  50% free) :    1 /     8,192    |      1 /     8,192    |     67 /   548,864
Freespace 3 ( 50 -  75% free) :    0 /         0    |      0 /         0    |      0 /         0
Freespace 4 ( 75 - 100% free) :    0 /         0    |      0 /         0    |      0 /         0
Full                          :  134 / 1,097,728    |    134 / 1,097,728    |     68 /   557,056


Index T1_ID                   Blocks       Bytes    |  After "Cleanup Only" |  After "Cleanup"
----------------------------------------------------|-----------------------|-------------------
Unformatted                   :    0 /         0    |      0 /         0    |      0 /         0
Freespace 1 (  0 -  25% free) :    0 /         0    |      0 /         0    |      0 /         0
Freespace 2 ( 25 -  50% free) :    1 /     8,192    |      1 /     8,192    |     65 /   532,480
Freespace 3 ( 50 -  75% free) :    0 /         0    |      0 /         0    |      0 /         0
Freespace 4 ( 75 - 100% free) :    0 /         0    |      0 /         0    |      0 /         0
Full                          :  128 / 1,048,576    |    128 / 1,048,576    |     64 /   524,288


There are two key points in these figures:

  • cleanup only doesn’t change the state of any space usage information, and any leaf blocks that are “empty” are still linke into the index structure.
  • cleanup doesn’t do anything to release space back to the tablespace; it simply compacts the data into a smaller number of blocks and tags empty blocks as “free” (and takes them out of the index structure – though that’s not “intuitively” obvious from the figures unless you know what FS2 means for indexes).

Index treedumps

If we want to find out how many of the FS2 blocks are in the index structure and how many have been taken out of the tree then we’ll have to do a tree dump – get the object_id of the index and issue:

alter session set events 'immediate trace name treedump level {object id}';

Here are the first few lines of the tree dump for the t1_n1 index in the same three states: after the drop, after cleanup only, and after cleanup:

Immediately after drop
branch: 0x9000684 150996612 (0: nrow: 128, level: 1)
   leaf: 0x9000685 150996613 (-1: row:449.449 avs:8)
   leaf: 0x9000686 150996614 (0: row:444.444 avs:4)
   leaf: 0x9000687 150996615 (1: row:444.444 avs:4)
   leaf: 0x9000688 150996616 (2: row:444.444 avs:4)
   leaf: 0x9000689 150996617 (3: row:444.444 avs:4)
   leaf: 0x900068a 150996618 (4: row:444.444 avs:4)
   leaf: 0x900068b 150996619 (5: row:444.444 avs:4)

After "Cleanup only"
branch: 0x9000684 150996612 (0: nrow: 128, level: 1)
   leaf: 0x9000685 150996613 (-1: row:449.214 avs:8)
   leaf: 0x9000686 150996614 (0: row:444.216 avs:4)
   leaf: 0x9000687 150996615 (1: row:444.216 avs:4)
   leaf: 0x9000688 150996616 (2: row:444.226 avs:4)
   leaf: 0x9000689 150996617 (3: row:444.216 avs:4)
   leaf: 0x900068a 150996618 (4: row:444.206 avs:4)
   leaf: 0x900068b 150996619 (5: row:444.231 avs:4)
   leaf: 0x900068c 150996620 (6: row:444.201 avs:4)
 
After "Cleanup"
branch: 0x9000684 150996612 (0: nrow: 64, level: 1)
   leaf: 0x9000685 150996613 (-1: row:446.446 avs:16)
   leaf: 0x9000688 150996616 (0: row:444.444 avs:4)
   leaf: 0x900068a 150996618 (1: row:444.444 avs:4)
   leaf: 0x900068c 150996620 (2: row:444.444 avs:4)
   leaf: 0x900068e 150996622 (3: row:444.444 avs:4)
   leaf: 0x9000690 150996624 (4: row:444.444 avs:4)
   leaf: 0x9000692 150996626 (5: row:444.444 avs:4)
   leaf: 0x9000694 150996628 (6: row:444.444 avs:4)
   leaf: 0x9000696 150996630 (7: row:444.444 avs:4)
   leaf: 0x9000698 150996632 (8: row:444.444 avs:4)
   leaf: 0x900069a 150996634 (9: row:421.421 avs:12)

This is the index where the values 0 to 599 have been spread randomly across the 54,000 different id values, so each n1 value appears in roughly 90 rows – of which we’ve deleted about 45 by dropping the bottom 3 partitions of 6.

For leaf blocks “row:X,Y” means there are X rows in the block directory of which Y would be visible if you ignored the ones marked as committed deletes, “avs:N” shows N bytes of available space in the block.

The t1_n1 index can fit 444 index entries per leaf block and we can see that immediately after the “drop partition” none of them has been marked as deleted. After a “cleanup only” however we can see that (as expected) roughly half the rows in every leaf block have been marked as deleted with half remaining. After a “cleanup” we can see that we’re back to 444 rows per leaf block with no deletions and virtually no freespace.,

Notice, however, the way that the leaf block addresses have changed during the cleanup. If we examine just the last 3 digits of the decimal version of the leaf block addresses we start with:

613, 614, 615, 616, 617, 618, 618, 619, 620

but we end with:

613, 616, 618, 620 ..

Effectively, Oracle has “copied back” all the index entries from block 614 and some from block 615 to block 613, detaching the now-empty block 614 from the index structure, then it has copied the remaining row from block 615 to block 616 and copied back some rows from 617 to block 616, detaching the now-empty block 615. (It’s not likely that Oracle thinks in terms of “copying forward/back”, it’s more likely that Oracle simply reads through the index in order constructing new leaf blocks in private memory and has a simple algorithm for deciding which block to replace – and that algorithm might be something we can infer by looking at the redo dump.)

If we now examine the three tree dumps from index t1_id we can see why the volume of redo generated by the two indexes differs on the final “cleanup” phase of the code.

Immediately after drop
branch: 0x9000784 150996868 (0: nrow: 134, level: 1)
   leaf: 0x9000785 150996869 (-1: row:426.426 avs:7)
   leaf: 0x9000786 150996870 (0: row:421.421 avs:1)
   leaf: 0x9000787 150996871 (1: row:421.421 avs:1)
   leaf: 0x9000788 150996872 (2: row:421.421 avs:1)
   leaf: 0x9000789 150996873 (3: row:421.421 avs:2)
   leaf: 0x900078a 150996874 (4: row:421.421 avs:1)
   leaf: 0x900078b 150996875 (5: row:421.421 avs:1)
...
   leaf: 0x90007c6 150996934 (64: row:400.400 avs:0)
   leaf: 0x90007c7 150996935 (65: row:400.400 avs:0)
   leaf: 0x90007c8 150996936 (66: row:400.400 avs:0)
   leaf: 0x90007c9 150996937 (67: row:400.400 avs:0)
   leaf: 0x90007ca 150996938 (68: row:400.400 avs:0)
   leaf: 0x90007cb 150996939 (69: row:400.400 avs:0)

After "Cleanup only"
branch: 0x9000784 150996868 (0: nrow: 134, level: 1)
   leaf: 0x9000785 150996869 (-1: row:426.0 avs:7)
   leaf: 0x9000786 150996870 (0: row:421.0 avs:1)
   leaf: 0x9000787 150996871 (1: row:421.0 avs:1)
   leaf: 0x9000788 150996872 (2: row:421.0 avs:1)
   leaf: 0x9000789 150996873 (3: row:421.0 avs:2)
   leaf: 0x900078a 150996874 (4: row:421.0 avs:1)
   leaf: 0x900078b 150996875 (5: row:421.0 avs:1)
...
   leaf: 0x90007c6 150996934 (64: row:400.0 avs:0)
   leaf: 0x90007c7 150996935 (65: row:400.303 avs:0)
   leaf: 0x90007c8 150996936 (66: row:400.400 avs:0)
   leaf: 0x90007c9 150996937 (67: row:400.400 avs:0)

After "Cleanup"
branch: 0x9000784 150996868 (0: nrow: 68, level: 1)
   leaf: 0x90007c7 150996935 (-1: row:303.303 avs:1940)
   leaf: 0x90007c8 150996936 (0: row:400.400 avs:0)
   leaf: 0x90007c9 150996937 (1: row:400.400 avs:0)
   leaf: 0x90007ca 150996938 (2: row:400.400 avs:0)
   leaf: 0x90007cb 150996939 (3: row:400.400 avs:0)
   leaf: 0x90007cc 150996940 (4: row:400.400 avs:0)
   leaf: 0x90007cd 150996941 (5: row:400.400 avs:0)

I’ve shown two sections of the treedump for the first two extracts, the start of the index and the start of the “2nd half” of the index where the id values are in the partitions that we kept. You’ll notice that the number of index entries per leaf block drops from 426 to 400 as we move through the index, that’s just the effect of a sequential id generally getting bigger (42 takes 2 bytes, 42,000 takes 3 bytes, 42,042 takes 4 bytes).

After “cleanup only” all the leaf blocks in the first half the the index show “no rows remaining of 4xx”, while all the leaf blocks in the 2nd half the index show “400 rows remaining of 400”. There is one special case – the leaf block numbered 65 at address 0x90007c7 – which shows “303 rows remaing of 400”. That must be the block that held the highest few rows from partition p27000.

After the final cleanup we can see that this “mid-point” leaf block has become the “low value” leaf block, and the rest of the index leaf blocks look as if they are completely unchanged. (I think we can assume that the “copy forward/back” code caters for a few boundary conditions that (e.g.) stop Oracle from doing something silly with leaf blocks that are already completely full.)

In this case, then, we can guess that Oracle has simply removed 66 leaf blocks (blocks “-1” to 64) from the index structure and reconnected block number 65 as the starting block. In the previous case the “cleanup” redo disconnected a simlar number of leaf blocks, but also rewrote all the blocks that were kept or emptied.

Dumping Redo

The information we have examined so far prompts us to ask two significant questions:

  • Moving from the “cleanup only” to the “cleanup” increased the redo by 345KB in the case of the sequential t1_id index, but by 2.7MB in the case of the “random arrival” t1_n1 index. In both cases the number of leaf blocks to be relinked is very similar, so the difference seems to be due to the work done in compacting roughly 128 leaf blocks down to 64 leaf blocks – how does that generate (2700KB – 345KB)/64 = 36KB per “new” block? We might also wonder why “just” relinking 66 blocks appears to take 345K/66 = 5K per block anyway – that seems a little surprising.
  • For the full cleanup operation does Oracle delete the entries from a few consecutive blocks, compact them and write them and then carry on with the next few blocks; or (less likely) does it walk the entire index deleting entries and then walk the index again to tidy up and compact (logically) adjacent blocks. If the latter then for a very large index (rather than my tiny test) that means we could see two consecutive full scans physically reading the whole index, and possibly generating more redo thanks to delayed block cleanout.

If we dump the redo log we can answer these questions by extracting some fairly simple details – although the resulting trace file is going to be rather large. For example we could code:

column current_scn new_value start_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter index t1_n1 coalesce cleanup;

column current_scn new_value end_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter session set tracefile_identifier='n1_index';
alter system dump redo scn min &start_scn scn max &end_scn /* layer 10 */;
alter session set tracefile_identiier='';


I’ve included a commented out “layer 10” in the dump command to show how you can be selective in what redo you dump. Layer 10 is the set of redo op codes for index-related change vectors. You will find other op codes (in particular 5.1) being dumped as well because Oracle dumps the whole of any redo record containing a change vector of the reqeusted type.

When I dumped the redo for a single index cleanup the trace file was about 45MB – so not something you would want to read in detail – but you could start with a few simple searches, for example:

grep -n "OP:10"       or19_ora_13388_n1_index.trc >temp_n1_op10.txt
grep -n "REDO RECORD" or19_ora_13388_n1_index.trc >temp_n1_record.txt

grep "OP:"  or19_ora_13388_n1_index.trc | sed "s/^.*OP:/OP:/" | sed "s/ .*$//" | sort | uniq -c | sort -n

The first grep example simply extracts all the index-related op-codes (with line numbers from the trace file to make it easier to spot patterns. The second grep does the same for the start of each redo record because those lines also report the length of each record, which may make it possible to find out more about the surprising amount of redo generated by the compacting and relinking.

The third example is just a bit of showing off: it extracts all the op-code lines, cuts the actual OP:nn.nn bit out, then sorts and counts the number of appearances of each op-code. Here’s the result of that last command from one of my tests:

      1 OP:13.24
      1 OP:17.28
      1 OP:24.1
      2 OP:11.5
      3 OP:10.11
      3 OP:11.3
     15 OP:10.14
     18 OP:14.4
     18 OP:22.5
     28 OP:10.7
     29 OP:5.11
     36 OP:22.2
     64 OP:13.22
     68 OP:10.34
     82 OP:10.39
    138 OP:10.8
    205 OP:5.4
    219 OP:10.6
    259 OP:4.1
    324 OP:10.5
    347 OP:5.6
    689 OP:5.2
  27327 OP:10.4
  27833 OP:5.1

It’s an obvious guess that the 27,000 OP:10.4 at the bottom are “delete index entry” and most of the OP:5.1 are their corresponding undo change vectors. The OP:10.4 count is a little high, but checking the session activity stats I found that they showed: rollback changes – undo records applied = 376, so some internal transaction rollback took place, and the session stats are a reasonable match for the “excess” OP:10.4, suggesting that at some point we saw a batch of deletes rollback and restart. (NOTE: to be investigated further; this suggests that the entire operation is executing as a number of relatively small transactions and could safely be interrupted – a hypothesis supported by the 205 OP:5.4 (commit/rollback change vectors)). The “excess” OP:10.4 are also closely matched by the OP:10.5 / OP:5.6 figures (restore leaf during rollback / mark user undo record applied)

Before chasing any other details let’s answer the question about whether the compacting takes place during the delete phase or starts after the deletes are all done. If we jump to the end of the “OP:10” file we can check to see if there are OP:10.4 all the way through the file with small patches of other layer 10 op codes stuff scattered throughout, or if the file is continuously doing OP:10.4 and nothing else until a load of other layer 10 op codes appear in the last couple of thousand lines. The answer is that we get regular repeats of a pattern like the following:

OP:10.4 x ~400
OP:10.6 
OP:10.6 
OP:10.6 
OP:10.39 
OP:10.8 
OP:10.34 
OP:10.8 

For example (after lots of 10.4 on blocks 0x9000695 and 0x9000696. we see:

119184:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000696 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000
119210:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000695 OBJ:131045 SCN:0x0000000002165407 SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000
119236:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000694 OBJ:131045 SCN:0x0000000002165405 SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000
119272:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000684 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:10.39 ENC:0 RBL:0 FLG:0x0000
119625:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000694 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
120499:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000695 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:10.34 ENC:0 RBL:0 FLG:0x0000
120862:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000696 OBJ:131045 SCN:0x0000000002165409 SEQ:2 OP:10.8 ENC:0 RBL:0 FLG:0x0000

This translates to:

10.6:  lock block 0x09000696
10.6:  lock block 0x09000695
10.6:  lock block 0x09000694
10.39: update branch block 0x09000684
10.8:  new block 0x09000694
10.34: empty block 0x09000695
10.8:  new block 0x09000696

The pattern then repeats starting with deletes from block 0x09000697 and 0x09000698, and then producing an interesting detail:

DBA:0x09000698 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.6 
DBA:0x09000697 OBJ:131045 SCN:0x000000000216540d SEQ:1 OP:10.6 
DBA:0x09000696 OBJ:131045 SCN:0x000000000216540b SEQ:1 OP:10.6 
DBA:0x09000684 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.39 
DBA:0x09000696 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.8 
DBA:0x09000697 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.34 
DBA:0x09000698 OBJ:131045 SCN:0x000000000216540f SEQ:2 OP:10.8 

Check line 5 above – it’s another OP:10.8 creating another version of block 0x09000696. And that’s one of the reasons why the compaction creates far more redo than expected. Oracle may recreate a block several times before the block gets to its final compacted state.

Looking at the detail of the trace file – in particular the pattern of deletes followed by “new block” – it looks as if Oracle deletes all the rows from just two adjacent blocks (perhaps to minimise block-level locking) and then does the best it can with the rows that are left and this may mean (as it does in our fragment) writing one full block and one partial block. For a sparsely populated index it might mean writing just a single partial block, possibly repeating the process for several cycles.

To show the total effect on redo generation I’ve extracted the redo records for a complete cycle of the pattern (excluding several hundred deletes, which generate 228 bytes each). I’ve used egrep to pick out 3 patterns “OP:”, “REDO RECORD” and (using hindsight) “new block has”:

132650:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e90.0134 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
132652:CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x09000698 OBJ:131045 SCN:0x000000000216540e SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

132656:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e90.018c LEN: 0x0164 VLD: 0x01 CON_UID: 3792595
132658:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x00000000021653f8 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
132661:CHANGE #2 CON_ID:3 TYP:1 CLS:18 AFN:17 DBA:0x04406fdf OBJ:4294967295 SCN:0x000000000216540f SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
132682:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000698 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

132691:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e91.0100 LEN: 0x00e4 VLD: 0x01 CON_UID: 3792595
132693:CHANGE #1 CON_ID:3 TYP:0 CLS:18 AFN:17 DBA:0x04406fdf OBJ:4294967295 SCN:0x000000000216540f SEQ:2 OP:5.1 ENC:0 RBL:0 FLG:0x0000
132708:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000697 OBJ:131045 SCN:0x000000000216540d SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

132717:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e91.01e4 LEN: 0x00e4 VLD: 0x01 CON_UID: 3792595
132719:CHANGE #1 CON_ID:3 TYP:0 CLS:18 AFN:17 DBA:0x04406fdf OBJ:4294967295 SCN:0x000000000216540f SEQ:3 OP:5.1 ENC:0 RBL:0 FLG:0x0000
132734:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000696 OBJ:131045 SCN:0x000000000216540b SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

132743:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e92.00d8 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
132745:CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x09000684 OBJ:131045 SCN:0x000000000216540a SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

132749:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e92.0130 LEN: 0x0128 VLD: 0x01 CON_UID: 3792595
132751:CHANGE #1 CON_ID:3 TYP:0 CLS:18 AFN:17 DBA:0x04406fdf OBJ:4294967295 SCN:0x000000000216540f SEQ:4 OP:5.1 ENC:0 RBL:0 FLG:0x0000
132770:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000684 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.39 ENC:0 RBL:0 FLG:0x0000

132779:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e93.0068 LEN: 0x4038 VLD: 0x01 CON_UID: 3792595
132781:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x000000000216540f SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
132784:CHANGE #2 CON_ID:3 TYP:1 CLS:18 AFN:17 DBA:0x04406fe0 OBJ:4294967295 SCN:0x000000000216540f SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
133123:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000696 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
133139:new block has 444 rows

133653:REDO RECORD - Thread:1 RBA: 0x0002c8.00046eb5.0010 LEN: 0x20e8 VLD: 0x05 CON_UID: 3792595
133656:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x000000000216540f SEQ:2 OP:5.2 ENC:0 RBL:0 FLG:0x0000
133659:CHANGE #2 CON_ID:3 TYP:1 CLS:18 AFN:17 DBA:0x04406fe1 OBJ:4294967295 SCN:0x0000000002165410 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
133998:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000697 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.34 ENC:0 RBL:0 FLG:0x0000

134007:REDO RECORD - Thread:1 RBA: 0x0002c8.00046ec6.0010 LEN: 0x0064 VLD: 0x01 CON_UID: 3792595
134009:CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09000680 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:13.22 ENC:0 RBL:0 FLG:0x0000

134017:REDO RECORD - Thread:1 RBA: 0x0002c8.00046ec6.0074 LEN: 0x3a74 VLD: 0x01 CON_UID: 3792595
134019:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x0000000002165410 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
134022:CHANGE #2 CON_ID:3 TYP:1 CLS:18 AFN:17 DBA:0x04406fe2 OBJ:4294967295 SCN:0x0000000002165410 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
134361:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000698 OBJ:131045 SCN:0x000000000216540f SEQ:2 OP:10.8 ENC:0 RBL:0 FLG:0x0000
134377:new block has 362 rows

134799:REDO RECORD - Thread:1 RBA: 0x0002c8.00046ee4.00c8 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
134801:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x0000000002165410 SEQ:2 OP:5.4 ENC:0 RBL:0 FLG:0x0000

This shows a complete transaction (the previous redo record is the commit (OP:5.4) for the few hundred deletes. I’ve highlighted five lines in the output – the three redo record headers that report a large value for their LEN and two lines that show you how many rows are in the “new” blocks generated by the op codes OP:10.8

  • Line 24 – LEN = 0x4038 = 16,440 bytes. That’s basically the size of two blocks: the OP:5.1 is the old image (8KB) of block 0x9000696, the OP:10.8 is the new image (also 8KB – since its full) of the block
  • Line 30 – LEN = 0x208 = 8,424. That’s basically one block image. The OP:10.34 is “clear block 0x09000698” which requires very little redo, but the OP:5.1 is the old image (8KB) of the block.
  • Line 38 – LEN = 0x3a74 = 14,964 bytes. Again that’s basically the size of two blocks: the OP:5.1 is the old image (8KB) of block 0x9000698, the OP:10.8 is the new image of the block, but it has only 362 rows (of a final 444 rows) in it, so it’s image dump can be restricted to about two pieces totalling 6.5KB

All the other bits add up to about 1,200 bytes of redo, so in each cycle the compacting activity generates about 40KB of redo in total. Since we end up with 68 filled blocks the total “extra” redo we got as we switched from “coalesce compress only” to “coalesce compress” should be around 68 * 40KB = 2.65 MB, which is pretty close to what we atually saw for the t1_n1 index.

Transactions and Locking

I commented on the presence of the session statistics reporting “rollback changes – undo records applied”, and mentioned the presence of the OP:5.4 records in the redo. These are all pointers to the coalesce command being operated as a series of smaller transactions rather than one large transaction, and even before I had started to dump and examine the redo I had added v$rollstat and v$enqueuestat to my usual snapshot of dynamic performance views. Here are the results from a typical “coalesce cleanup” test for the t1_n1 index.

---------------------------------
Rollback stats
---------------------------------
USN   Ex Size K  HWM K  Opt K      Writes     Gets  Waits Shr Grow Shr K  Act K
----  -- ------  -----  -----      ------     ----  ----- --- ---- ----- ------
   0   0      0      0      0           0        2      0   0    0     0      0
   1   0      0      0      0      382184       85      0   0    0     0      0
   2   0      0      0      0      412930       91      0   0    0     0     61
   3   0      0      0      0      404260       90      0   0    0     0      0
   4   0      0      0      0      413412       91      0   0    0     0   -106
   5 -13  -1792      0      0      341294       99      0   2    0    24    -71
   6   0      0      0      0      355986       81      0   0    0     0      0
   7   0      0      0      0      444054       97      0   0    0     0      0
   8 -12  -1728      0      0      418474      114      1   2    0    24    -45
   9   0      0      0      0      422776       92      0   0    0     0     38
  10   0      0      0      0      426976       94      0   0    0     0      0


----------------------------------
System enqueues
----------------------------------
Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
KI             3           0           3           0           0 contention
CR           250          24         250           0          12 block range reuse ckpt
IS            25           0          25           0           0 contention
TM             8           0           8           0           0 contention
TA             2           0           2           0           0 contention
TX           198           0         198           0           0 contention
US             8           0           8           0           0 contention
HW             8           0           8           0           0 contention
TT             4           0           4           0           0 contention
SJ             2           0           2           0           0 Slave Task Cancel
CU             2           0           2           0           0 contention
OD             1           0           1           0           0 Serializing DDLs

-------------------------------------
System REDO stats
-------------------------------------
Name                                       Value
----                                       -----
redo entries                              27,977
redo size                              8,853,604
undo change vector size                3,967,100

I have 10 undo segments online, and each has received about 400KB of undo – which comes to a total of 4MB – which is a very good match to the “undo change vector size” reported for the session. (I didn’t get any system generated rollbacks in the run.)

We can also see 198 TX enqueues – which is a very good match for the 196 OP:5.4 that I found in the sessions redo dump on this test.

As I commented in the section on redo generation, to handle the “coalesce cleanup” the session walks the index in leaf block order and (with variations dependent on situations like finding leaf blocks with nothing to delete, or finding leaf blocks that become empty after all the deletes have been done)

  • locks two logically adjacent leaf blocks
  • deletes the dropped rows
  • commits
  • locks the leaf blocks agin – sometimes with a third
  • packs the outstanding rows downwards
  • relinks leaf blocks and adjusts branch blocks as necessary
  • commits
  • repeats until end of index

(Note – “cleanup only” seems to lock, delete rows and commit each block separately, not in pairs)

This strategy has two side-effects. First, though I don’t think it’s documented, it looks as if you could simply kill the process if it started putting too much stress on your system. All that could happen is that the current (small) transaction would be rolled back, but the rest of the work that had been done up to that moment would persist. I have actually tested this, managing to kill sessions while they were in the middle of delete a batch of rows (though I’ve never managed to catch an example where a session was in the middle of relinking). After I repeated the coalesce command Oracle simply picked up where it had left off after the previous (small) rollback.

The second side effect is another possible overhead. Partitioned tables tend to be big, and each index clean-up is likely to be a fairly big job generating a lot of redo and undo. How much impact could the undo have on your system? This depends in part on what your undo retention looks like, how long the clean-up takes, and the risk of other sessions running into ORA-01555 errors. In particular there’s an odd problem of undo segment extension causing updates to restart – if an update statement (or delete, or merge with update clause) results in an undo segment extending then the statement will rollback and restart using a different locking strategy.

I don’t know if something of this sort would happen with the deletes from an index coalesce, and the deletes are very small anyway so it might not matter anyway, but the coalesce executes a large number of transactions in a relatively short time period rotating around and gradually filling every available undo segment – which means other large updates are more likely to need to extend an undo segment. And if segment extension does start to become a problem it might happen many times to the coalesce because each individual delete phase is a new transaction that could (in theory) rollback and repeat. So you should be monitoring the session / system activity stats for unusually large number for “rollback changes – undo records applied”.

It’s also worth noting that overlapping jobs will sometimes need to do a lot of work to check read (and write) consistency, and to find “upper bound commit” SCNs: if you have a process executing a large number of transactions in a short period of time it can be very expensive for another process to find an “upper bound commit” and you may see it doing a lot of reads of undo segments, reporting a number of “transaction tables consistent read rollbacks” and a large number of “transaction tables consistent reads – undo records applied” in the session stats. (Worst case scenario – the number of transactions could be similar to the number of (used) leaf blocks in the index).

One final point to consider – the report from the (system level) Enqueue Stats showed a handful of TM locks, so I enabled lock tracing for the session to see if any of them came from my session and whether they were likely to be a concurrency threat.

alter session set events 'trace[ksq] disk=medium';

Most of the TM enqueues were from my session, but they appeared only after the coalesce was complete, and they were all related to sys-owned (dictionary) tables:

2022-08-05 19:49:27.902*:ksq.c@9175:ksqgtlctx(): *** TM-00000140-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.902*:ksq.c@9175:ksqgtlctx(): *** TM-00000061-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.902*:ksq.c@9175:ksqgtlctx(): *** TM-00000049-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.902*:ksq.c@9175:ksqgtlctx(): *** TM-00000004-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.903*:ksq.c@9175:ksqgtlctx(): *** TM-0000004B-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.903*:ksq.c@9175:ksqgtlctx(): *** TM-00000013-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***

In case you’re wondering, the first TM lock (object 0x140) is the index_orphaned_entry$ table, and Oracle had to lock it at the end of the “coalesce clean up” to delete the rows that associated the three dropped table partitions with the one global index that I had cleaned.

Summary

Deferred global index maintence means you can drop partitions very quickly with virtually no interruption to service and then have Oracle clean up the related index entries at a later point in time. The drawback to deferring the cleanup is that Oracle will (as a minimum) use a row-by-row mechanism to mark all the index entries for the dropped data as deleted – at a cost of a full index scan and about 230 bytes of redo per dropped row per index – compared to a bulk-processing mechanism that can be applied if the index entries are dropped as part of the partition drop processing.

There are basically two mechanisms you can use to clean up the entries that are waiting cleaning:

  • rebuild the index (online, probably)
  • use one of the special coalesce calls on the index

You can best spread the workload and minimise interference with your normal work by micro-managing the clean up, explicitly issuing whichever “alter index coalesce” or “alter index rebuild” command you prefer for each index in turn that needs to be cleaned up.

None of the coalesce command variants returns space to the tablespace; nor do they even drop the highwater mark on the index segment. If you want to return space to the tablespace you will need to excute a “shrink space” on the index after the coalesce is complete. All the coalesce options generate a very large amount of redo (230 bytes per index entry for each delete plus – to compact the remaining rows into the smallest number of blocks – a volume that may be several times larger than the actual volume of data in the index).

The rebuild (online) will generate a much smaller volume of redo – but the penalties include the problem of journalling and applying the changes that took place as the rebuild was running; plus the cost of scanning the table and sorting the data to produce the index.

Footnote

My tests basically cover the worst case scenario (every leaf block has some entries to be deleted) and best case scenario (leaf blocks will either become completely empty on deletion, or will have no rows deleted).

If you do want to enable “instant” index maintenance (i.e. disable deferred maintanance) for a session you could execute:

alter session set "_fast_index_maintenance" = false;

The usual warning about not messing with hidden parameters until you’ve confirmed with Oracle support applies, of course.

7 Comments »

  1. […] Drop Partition (Aug 2022): the mechanics and cost of deferred global index maintenance […]

    Pingback by Indexing Catalogue | Oracle Scratchpad — August 6, 2022 @ 5:14 pm BST Aug 6,2022 | Reply

  2. […] Drop Partition (Aug 2022): the mechanics and cost of deferred global index maintenance […]

    Pingback by Infrastructure Catalogue | Oracle Scratchpad — August 6, 2022 @ 5:15 pm BST Aug 6,2022 | Reply

  3. […] Drop Partition (Aug 2022): the mechanics and cost of deferred global index maintenance […]

    Pingback by Partitioning Catalogue | Oracle Scratchpad — August 6, 2022 @ 5:16 pm BST Aug 6,2022 | Reply

  4. Hi Jonathan,

    The continued presence of the “dropped” index entries won’t make your application run more slowly, it’s just that you won’t have achieved the (possible) benefit of a smaller index that might allow the application to run a little faster.

    There are some scenarios when the “dropped” index entries do cause changes in execution plans, resulting in adverse performance effects:
    1. my post from 2016: https://mvelikikh.blogspot.com/2016/05/asynchronous-global-index-maintenance.html
    2. Christian Antognini’s post from 2019: https://antognini.ch/2019/06/min-max-optimization-and-asynchronous-global-index-maintenance/

    Basically, it is the same case when Oracle cannot use the MIN/MAX optimization in the presence of orphaned index entries.

    Thanks,
    Mikhail

    Comment by Mikhail Velikikh — August 9, 2022 @ 2:00 pm BST Aug 9,2022 | Reply

    • Mikhail,
      Thanks for the comment – excellent observation. There are always special cases and boundary conditions that produce unexpected effects and I have to say that I have come across other cases with partitioned tables where Oracle behaves as if the only way to get at a rowid is by visiting the table, so an index-only path disappears when you partition the table and give it local indexes.

      I haven’t looked at Christian’s example yet, but I note that you’ve tagged the article with 12.1.0.2 – have you tested with a more recent version of Oracle. With my 19.11.0.0 I do get an index full scan (min/max) select select min() or max().

      After dropping the three table partitions:

      
      SQL> select max(n1) from t1 where n1 is not null;
      
         MAX(N1)
      ----------
             599
      
      SQL> select * from table(dbms_xplan.display_cursor);
      
      PLAN_TABLE_OUTPUT
      --------------------------------------------------------------------------------------
      SQL_ID  0bnygtfhkagr5, child number 0
      -------------------------------------
      select max(n1) from t1 where n1 is not null
      
      Plan hash value: 3922172223
      
      -------------------------------------------------------------------------------------
      | Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
      -------------------------------------------------------------------------------------
      |   0 | SELECT STATEMENT            |       |       |       |     2 (100)|          |
      |   1 |  SORT AGGREGATE             |       |     1 |     4 |            |          |
      |   2 |   FIRST ROW                 |       |     1 |     4 |     2   (0)| 00:00:01 |
      |*  3 |    INDEX FULL SCAN (MIN/MAX)| T1_N1 |     1 |     4 |     2   (0)| 00:00:01 |
      -------------------------------------------------------------------------------------
      
      Predicate Information (identified by operation id):
      ---------------------------------------------------
         3 - filter((TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1 AND "N1" IS NOT NULL))
      
      
      SQL> select max(id) from t1 where id is not null;
      
         MAX(ID)
      ----------
           53999
      
      SQL> select * from table(dbms_xplan.display_cursor);
      
      PLAN_TABLE_OUTPUT
      --------------------------------------------------------------------------------------
      SQL_ID  0myp2d10ygh2b, child number 0
      -------------------------------------
      select max(id) from t1 where id is not null
      
      Plan hash value: 151196066
      
      -------------------------------------------------------------------------------------
      | Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
      -------------------------------------------------------------------------------------
      |   0 | SELECT STATEMENT            |       |       |       |     2 (100)|          |
      |   1 |  SORT AGGREGATE             |       |     1 |     5 |            |          |
      |   2 |   FIRST ROW                 |       |     1 |     5 |     2   (0)| 00:00:01 |
      |*  3 |    INDEX FULL SCAN (MIN/MAX)| T1_ID |     1 |     5 |     2   (0)| 00:00:01 |
      -------------------------------------------------------------------------------------
      
      Predicate Information (identified by operation id):
      ---------------------------------------------------
         3 - filter((TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1 AND "ID" IS NOT NULL))
      
      
      SQL> select min(n1) from t1 where n1 is not null;
      
         MIN(N1)
      ----------
               0
      
      SQL> select * from table(dbms_xplan.display_cursor);
      
      PLAN_TABLE_OUTPUT
      --------------------------------------------------------------------------------------
      SQL_ID  6yh0280d050jr, child number 0
      -------------------------------------
      select min(n1) from t1 where n1 is not null
      
      Plan hash value: 3922172223
      
      -------------------------------------------------------------------------------------
      | Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
      -------------------------------------------------------------------------------------
      |   0 | SELECT STATEMENT            |       |       |       |     2 (100)|          |
      |   1 |  SORT AGGREGATE             |       |     1 |     4 |            |          |
      |   2 |   FIRST ROW                 |       |     1 |     4 |     2   (0)| 00:00:01 |
      |*  3 |    INDEX FULL SCAN (MIN/MAX)| T1_N1 |     1 |     4 |     2   (0)| 00:00:01 |
      -------------------------------------------------------------------------------------
      
      Predicate Information (identified by operation id):
      ---------------------------------------------------
         3 - filter((TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1 AND "N1" IS NOT NULL))
      
      
      SQL> select min(id) from t1 where id is not null;
      
         MIN(ID)
      ----------
           27000
      
      SQL> select * from table(dbms_xplan.display_cursor);
      
      PLAN_TABLE_OUTPUT
      --------------------------------------------------------------------------------------
      SQL_ID  0cv7vq5smh39u, child number 0
      -------------------------------------
      select min(id) from t1 where id is not null
      
      Plan hash value: 151196066
      
      -------------------------------------------------------------------------------------
      | Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
      -------------------------------------------------------------------------------------
      |   0 | SELECT STATEMENT            |       |       |       |     2 (100)|          |
      |   1 |  SORT AGGREGATE             |       |     1 |     5 |            |          |
      |   2 |   FIRST ROW                 |       |     1 |     5 |     2   (0)| 00:00:01 |
      |*  3 |    INDEX FULL SCAN (MIN/MAX)| T1_ID |     1 |     5 |     2   (0)| 00:00:01 |
      -------------------------------------------------------------------------------------
      
      Predicate Information (identified by operation id):
      ---------------------------------------------------
         3 - filter((TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1 AND "ID" IS NOT NULL))
      
      
      

      Regards
      Jonathan Lewis

      Comment by Jonathan Lewis — August 9, 2022 @ 2:41 pm BST Aug 9,2022 | Reply

    • Mikhail,

      I’ve now had a look at Chris’ example, and since I could do a simple cut-n-paste ran it on my 19.11.0.0 and got the same result, i.e. index fast full scan. The biggest difference between your example and his, of course, was that his index was globally hash partitioned.

      However, when I did my first test I simply stopped my original test after dropping three partitions and then tested a min/max query – and through habit included the predicate “column is not null” since that what I always expect to do when there’s no “not null” declaration. When I added the predicate to Chris’ query it went back to the index full scan (min/max) path.

      When I repeated my test without the “column is not null” the optimizer used the index fast full scan path; worse, though, was that when I declared the columns not null the plan was still an index fast full scan, and when I had a NOT NULL column and added the column is not null predicate the plan was STILL the index fast full scan.

      Regards
      Jonathan Lewis

      Comment by Jonathan Lewis — August 9, 2022 @ 3:06 pm BST Aug 9,2022 | Reply

      • Hi Jonathan,

        I have run some additional tests with 19.11. The script that I used is this:

        set echo on lines 110 serverout off pages 100
        
        select banner_full from v$version;
        
        alter session set statistics_level=all;
        
        drop table t1;
        
        create table t1 
        partition by range(id) (
                partition p09000 values less than ( 9000),
                partition p18000 values less than (18000),
                partition p27000 values less than (27000),
                partition p36000 values less than (36000),
                partition p45000 values less than (45000),
                partition p54000 values less than (54000)
        )
        as
        select
                rownum - 1                      id,
                trunc(dbms_random.value(0,600)) n1,
                rpad('x',100)                   padding
        from
                --all_objects ao
                xmltable('1 to 54000')
        where
                rownum <= 54000
        ;
         
        create index t1_n1 on t1(n1) pctfree 0;
        create index t1_id on t1(id) pctfree 0;
        
        create or replace function f_rowid(row_id rowid)
          return number
        is
        begin
          return 1;
        end;
        /
        
        create or replace function f_number(num number)
          return number
        is
        begin
          return 1;
        end;
        /
         
        select 
                index_name, num_rows, s.blocks, leaf_blocks, status, orphaned_entries 
        from 
                user_indexes i, user_segments s 
        where 
                i.index_name = s.segment_name 
        and     i.table_name='T1' 
        and     partitioned = 'NO'
        ;
        
        begin
                dbms_stats.gather_table_stats(
                        ownname          => user,
                        tabname          =>'T1',
                        method_opt       => 'for all columns size 1'
                );
        end;
        /
        
        spo before
        
        pro ################################################################################ 
        pro # BEFORE                                                                       #
        pro ################################################################################ 
        
        alter session set events 'trace[sql_optimizer.*]' tracefile_identifier=before;
        select max(id) from t1;
        alter session set events 'trace[sql_optimizer.*] off';
        
        select max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        select max(id) from t1 where f_rowid(rowid)=1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        select max(id) from t1 where f_number(id)=1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        select max(id) from t1 where id>=40000;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        select /*+ index_ffs(t1)*/max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ index(t1)*/max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ index(t1)*/max(id) from t1 where f_rowid(rowid)=1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ index(t1)*/max(id) from t1 where f_number(id)=1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ index(t1) opt_param('_fix_control' '20321661:0')*/max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ no_index_ffs(t1)*/max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        spo off
         
        alter table t1 drop partition p09000, p18000, p27000 update global indexes; 
         
        begin
                dbms_stats.gather_table_stats(
                        ownname          => user,
                        tabname          =>'T1',
                        method_opt       => 'for all columns size 1'
                );
        end;
        /
         
        select 
                index_name, num_rows, s.blocks, leaf_blocks, status, orphaned_entries 
        from 
                user_indexes i, user_segments s 
        where 
                i.index_name = s.segment_name 
        and     i.table_name='T1' 
        and     partitioned = 'NO'
        ;
        
        spo after
        
        pro ################################################################################ 
        pro # AFTER                                                                        #
        pro ################################################################################ 
        
        alter session set events 'trace[sql_optimizer.*]' tracefile_identifier=after;
        select max(id) from t1;
        alter session set events 'trace[sql_optimizer.*] off';
        
        select max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        select max(id) from t1 where f_rowid(rowid)=1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        select max(id) from t1 where f_number(id)=1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        select max(id) from t1 where id>=40000;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        select /*+ index_ffs(t1)*/max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ index(t1)*/max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ index(t1)*/max(id) from t1 where f_rowid(rowid)=1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ index(t1)*/max(id) from t1 where f_number(id)=1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ index(t1) opt_param('_fix_control' '20321661:0')*/max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        select /*+ no_index_ffs(t1)*/max(id) from t1;
        select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        spo off
        

        In a nutshell, it is your initial script in which I changed the following:
        1. used xmltable instead of all_objects (I do not have 54K rows there).
        2. added some SQL statements that I ran before and after dropping the partitions. I also spool the output to analyze it later on.
        3. The first query is run with SQL_Optimizer trace enabled.

        First, here is an anomaly that I discovered before dropping the partitions (it’s not related to the issue in question – just a result which I did not expect):

        SQL> select max(id) from t1;
        
           MAX(ID)                                                                                                    
        ----------                                                                                                    
             53999                                                                                                    
        
        SQL> select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        PLAN_TABLE_OUTPUT                                                                                             
        --------------------------------------------------------------------------------------------------------------
        SQL_ID  9s014c27wfym3, child number 0                                                                         
        -------------------------------------                                                                         
        select max(id) from t1                                                                                        
                                                                                                                      
        Plan hash value: 2419761730                                                                                   
                                                                                                                      
        -----------------------------------------------------------------------------------------------------------   
        | Id  | Operation                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |   
        -----------------------------------------------------------------------------------------------------------   
        |   0 | SELECT STATEMENT           |       |      1 |        |     2 (100)|      1 |00:00:00.01 |       2 |   
        |   1 |  SORT AGGREGATE            |       |      1 |      1 |            |      1 |00:00:00.01 |       2 |   
        |   2 |   INDEX FULL SCAN (MIN/MAX)| T1_ID |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       2 |   
        -----------------------------------------------------------------------------------------------------------   
                                                                                                                      
        
        14 rows selected.
        SQL> select /*+ index_ffs(t1)*/max(id) from t1;
        
           MAX(ID)                                                                                                    
        ----------                                                                                                    
             53999                                                                                                    
        
        SQL> select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        PLAN_TABLE_OUTPUT                                                                                             
        --------------------------------------------------------------------------------------------------------------
        SQL_ID  34rj7auqf4864, child number 0                                                                         
        -------------------------------------                                                                         
        select /*+ index_ffs(t1)*/max(id) from t1                                                                     
                                                                                                                      
        Plan hash value: 2419761730                                                                                   
                                                                                                                      
        -----------------------------------------------------------------------------------------------------------   
        | Id  | Operation                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |   
        -----------------------------------------------------------------------------------------------------------   
        |   0 | SELECT STATEMENT           |       |      1 |        |    38 (100)|      1 |00:00:00.01 |       2 |   
        |   1 |  SORT AGGREGATE            |       |      1 |      1 |            |      1 |00:00:00.01 |       2 |   
        |   2 |   INDEX FULL SCAN (MIN/MAX)| T1_ID |      1 |  54000 |    38   (0)|      1 |00:00:00.01 |       2 |   
        -----------------------------------------------------------------------------------------------------------   
                                                                                                                      
        Hint Report (identified by operation id / Query Block Name / Object Alias):                                   
        Total hints for statement: 1 (U - Unused (1))                                                                 
        ---------------------------------------------------------------------------                                   
                                                                                                                      
           2 -  SEL$1 / T1@SEL$1                                                                                      
                 U -  index_ffs(t1)                                                                                   
                                                                                                                      
        
        21 rows selected.
        

        As you can see, the plan_hash_value is the same for both queries, there are no additional predicates, and the hint is not used.
        However, the cost of the second query is 38 as if it was actually performing INDEX FAST FULL SCAN.
        Looks like an optimizer bug.

        Second, having compared CBO traces, I can see that Oracle does not consider access path Min/Max after the partitions are dropped:

        select max(id) from t1;
        -- before dropping the partitions we have this
         ****** Costing Index T1_ID
          Access Path: index (Min/Max)
            Index: T1_ID
            resc_io: 2.000000  resc_cpu: 14443
            ix_sel: 0.000000  ix_sel_with_filters: 0.000000 
            Cost: 2.000276  Resp: 2.000276  Degree: 1
          Best:: AccessPath: IndexRange
        

        The full CBO traces are available on GitHub: https://gist.github.com/mvelikikh/ead40e5d7746faf831779a815e8b8e8c

        Third, some queries resulted in the INDEX FULL SCAN (MIN/MAX) access path after the partitions were dropped:

        select max(id) from t1 where f_rowid(rowid)=1;
        select max(id) from t1 where f_number(id)=1;
        select /*+ index(t1)*/max(id) from t1;
        select /*+ no_index_ffs(t1)*/max(id) from t1;
        

        I believe your example with the NOT NULL predicate is another case when we have the index full scan (min/max) access path.
        This proves that we can force the desired access path, which Oracle does not consider by default, by using hints and adding additional predicates.

        Comparing cost of INDEX FAST FULL SCAN and INDEX FULL SCAN (MIN/MAX) after dropping the partitions:

        SQL> select max(id) from t1;
        
           MAX(ID)                                                                                                    
        ----------                                                                                                    
             53999                                                                                                    
        
        SQL> select * from dbms_xplan.display_cursor(format=>'allstats last cost');
        
        PLAN_TABLE_OUTPUT                                                                                             
        --------------------------------------------------------------------------------------------------------------
        SQL_ID  9s014c27wfym3, child number 0                                                                         
        -------------------------------------                                                                         
        select max(id) from t1                                                                                        
                                                                                                                      
        Plan hash value: 2457029738                                                                                   
                                                                                                                      
        ------------------------------------------------------------------------------------------------------        
        | Id  | Operation             | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |        
        ------------------------------------------------------------------------------------------------------        
        |   0 | SELECT STATEMENT      |       |      1 |        |    20 (100)|      1 |00:00:00.01 |     142 |        
        |   1 |  SORT AGGREGATE       |       |      1 |      1 |            |      1 |00:00:00.01 |     142 |        
        |*  2 |   INDEX FAST FULL SCAN| T1_ID |      1 |  27000 |    20   (0)|  27000 |00:00:00.01 |     142 |        
        ------------------------------------------------------------------------------------------------------        
                                                                                                                      
        Predicate Information (identified by operation id):                                                           
        ---------------------------------------------------                                                           
                                                                                                                      
           2 - filter(TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1)                                                    
        SQL> select /*+ index(t1)*/max(id) from t1;
        
           MAX(ID)                                                                                                    
        ----------                                                                                                    
             53999                                                                                                    
        
        SQL> select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        PLAN_TABLE_OUTPUT                                                                                             
        --------------------------------------------------------------------------------------------------------------
        SQL_ID  4dmk5gjsm3c8h, child number 0                                                                         
        -------------------------------------                                                                         
        select /*+ index(t1)*/max(id) from t1                                                                         
                                                                                                                      
        Plan hash value: 151196066                                                                                    
                                                                                                                      
        ------------------------------------------------------------------------------------------------------------  
        | Id  | Operation                   | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  
        ------------------------------------------------------------------------------------------------------------  
        |   0 | SELECT STATEMENT            |       |      1 |        |    69 (100)|      1 |00:00:00.01 |       2 |  
        |   1 |  SORT AGGREGATE             |       |      1 |      1 |            |      1 |00:00:00.01 |       2 |  
        |   2 |   FIRST ROW                 |       |      1 |  27000 |    69   (0)|      1 |00:00:00.01 |       2 |  
        |*  3 |    INDEX FULL SCAN (MIN/MAX)| T1_ID |      1 |  27000 |    69   (0)|      1 |00:00:00.01 |       2 |  
        ------------------------------------------------------------------------------------------------------------  
                                                                                                                      
        Predicate Information (identified by operation id):                                                           
        ---------------------------------------------------                                                           
                                                                                                                      
           3 - filter(TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1)                                                    
        

        Due to the fact that the cost of the second query is higher, I initially thought that the optimizer considers both access paths.
        Yet it is not the case as the optimizer trace files demonstrate: https://gist.github.com/mvelikikh/ead40e5d7746faf831779a815e8b8e8c

        I also tested the scenario when the highest partitions are dropped to see how the query would behave:

        -- dropping the highest partition
        alter table t1 drop partition p09000, p18000, p54000 update global indexes;
        
        ...
        SQL> select /*+ no_index_ffs(t1)*/max(id) from t1;
        
           MAX(ID)
        ----------
             44999
        
        SQL> select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        PLAN_TABLE_OUTPUT
        --------------------------------------------------------------------------------------------------------------
        SQL_ID  3fk5m75271219, child number 0
        -------------------------------------
        select /*+ no_index_ffs(t1)*/max(id) from t1
        
        Plan hash value: 151196066
        
        ------------------------------------------------------------------------------------------------------------
        | Id  | Operation                   | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
        ------------------------------------------------------------------------------------------------------------
        |   0 | SELECT STATEMENT            |       |      1 |        |    70 (100)|      1 |00:00:00.01 |      24 |
        |   1 |  SORT AGGREGATE             |       |      1 |      1 |            |      1 |00:00:00.01 |      24 |
        |   2 |   FIRST ROW                 |       |      1 |  27000 |    70   (0)|      1 |00:00:00.01 |      24 |
        |*  3 |    INDEX FULL SCAN (MIN/MAX)| T1_ID |      1 |  27000 |    70   (0)|      1 |00:00:00.01 |      24 |
        ------------------------------------------------------------------------------------------------------------
        
        Predicate Information (identified by operation id):
        ---------------------------------------------------
        
           3 - filter(TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1)
        
        Hint Report (identified by operation id / Query Block Name / Object Alias):
        Total hints for statement: 1
        ---------------------------------------------------------------------------
        
           3 -  SEL$1 / T1@SEL$1
                   -  no_index_ffs(t1)
        
        ...
        
        -- dropping two highest partitions
        alter table t1 drop partition p09000, p45000, p54000 update global indexes;
        
        ...
        
        SQL> select * from dbms_xplan.display_cursor(format=>'allstats last cost hint_report');
        
        PLAN_TABLE_OUTPUT
        --------------------------------------------------------------------------------------------------------------
        SQL_ID  3fk5m75271219, child number 0
        -------------------------------------
        select /*+ no_index_ffs(t1)*/max(id) from t1
        
        Plan hash value: 151196066
        
        ------------------------------------------------------------------------------------------------------------
        | Id  | Operation                   | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
        ------------------------------------------------------------------------------------------------------------
        |   0 | SELECT STATEMENT            |       |      1 |        |    69 (100)|      1 |00:00:00.01 |      47 |
        |   1 |  SORT AGGREGATE             |       |      1 |      1 |            |      1 |00:00:00.01 |      47 |
        |   2 |   FIRST ROW                 |       |      1 |  27000 |    69   (0)|      1 |00:00:00.01 |      47 |
        |*  3 |    INDEX FULL SCAN (MIN/MAX)| T1_ID |      1 |  27000 |    69   (0)|      1 |00:00:00.01 |      47 |
        ------------------------------------------------------------------------------------------------------------
        
        Predicate Information (identified by operation id):
        ---------------------------------------------------
        
           3 - filter(TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1)
        
        Hint Report (identified by operation id / Query Block Name / Object Alias):
        Total hints for statement: 1
        ---------------------------------------------------------------------------
        
           3 -  SEL$1 / T1@SEL$1
                   -  no_index_ffs(t1)
        

        I got correct results but the amount of buffers was linearly increasing: I got 2 buffers in the first test after dropping three lowest partitions vs 24 and 47 buffers when the highest and two highest partitions are dropped.
        Probably it is some decision made by optimizer developers – do not use index full scan (min/max) when there are orphaned entries. Otherwise, consistent performance cannot be guaranteed.
        It does not explain, though, why Oracle resorts to index full scan (min/max) when there are additional predicates present (your NOT NULL condition and some queries that I ran).
        Maybe the heuristic rule does not apply when predicates can be resolved without visiting the table (ID IS NOT NULL / ID > 0 / MOD(ID,2)>0 / F_ROWID(ROWID)=1 etc.) which can also be justified in my book.

        Thanks,
        Mikhail

        Comment by Mikhail Velikikh — August 11, 2022 @ 6:33 pm BST Aug 11,2022 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

%d bloggers like this: