Oracle Scratchpad

November 28, 2022

Hakan Factor

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 3:14 pm GMT Nov 28,2022

There’s a question on the MOSC forum (needs an account) at present that started with the performance of the datapump API over a database link but moved on to the topic of how to handle a scenario that I’ve described in the past involving a table where rows are intially short and eventually become much longer and a requirement comes up to rebuild the table.

In this case the OP has to use datapump (selecting truncate as the “action on existence”) to copy the table data from one place to another rather then doing the more common ‘alter table move’ variant of rebuilding the table.

The underlying problem in this case is that:

  • the table has 84 columns made up of (pk_col1, pk_col2, flag, change_date) plus 20 groups of 4 “value” columns.
  • rows are inserted with just the four main columns and the first group of four values.
  • over time each subsequent group of 4 values in a row is updated in a separate statement

We haven’t been given numbers but a row probably ends up taking about 10 times the space it started with – and if that’s the case you would normally need to set the table’s pctfree to something like 90 to avoid getting a lot of migrated rows in the table. But that’s not the whole of the story.

Things to go wrong

If you don’t set pctfree to 90 you get lots of migrated rows. If you then do an export (expdp) in direct_path mode expdp will do a large number of single block reads following the migrated rows, and Oracle won’t cache the follow-on blocks, so you may re-read them several times in the course of reading one block in the direct path tablescan. (For cached reads the codepath for a tablescan will simply ignore the “head pointer” to a migrated row because it “knows” that it will find the whole row in some other block eventually.)

If you do set pctfree to 90 then when you rebuild the table (or recreate it with pctfree set to 90) than you end up with a much larger table with lots of blocks that are only 10% used because most of the rows are now big and aren’t going to grow any more.

Best strategy – the Hakan factor.

Work out how many rows in their final state will fit into a block and recreate the table telling Oracle that that’s the maximum number of rows it’s allowed to put in a block. (You could also set pctfree to zero at the same time to minimise the chance of Oracle inserting fewer rows than your target.)

The devil, of course, is in the detail. Part of the devilry comes from a bug that was fixed some time as far back as 10.2.0.1. Part comes from the fact that Oracle doesn’t give us a documented API to set the magic number – we have to find a way to teach Oracle about the number or hack the data dictionary. Part, inevitably, comes from the fact that when dealing with undocumented (or barely documented) mechanisms you ought to set up some test cases to check that the latest version of Oracle behaves the same way as your previous versions of Oracle when you’re playing dirty tricks.

Part 1 – Teaching Oracle.

You may know your data so well that you can immediately say how many “full-length” rows should should fit a block. If you can’t do this you could simply create a copy of the original table structure with a pctfree of zero then copy into it a few hundred rows from the original table using a predicate to limit the selected rows to ones that would not be updated any further. For example (using the table definition supplied by the OP) you might say:

create table test_tab_clone 
pctfree 0 
as 
select  * 
from    test_tab 
where   rownum = 0
/

insert into test_tab_clone 
select  * 
from    t1 
where   rownum <= 400 
and     fourthvalue19 is not null
/

commit
/

I’m assuming in this case column “fourthvalue19” will only be non-null only if the whole of the 19th set of values is populated and all the other sets of values are populated. From the OP’s perspective there may be a more sensible way of identifying fully populated rows. You do need to ensure that the table has at least one full block otherwise some odd things can happen when you try to set the Hakan factor.

Once you’ve got a small table of full size rows a simple analysis of rows per block is the next step:

select
        rows_starting_in_block,
        count(*)        blocks
from
        (
        select
                dbms_rowid.rowid_relative_fno(rowid),
                dbms_rowid.rowid_block_number(rowid),
                count(*)                                rows_starting_in_block
        from
                test_tab_clone
        group by
                dbms_rowid.rowid_relative_fno(rowid),
                dbms_rowid.rowid_block_number(rowid)
        )
group by
        rows_starting_in_block
order by
        rows_starting_in_block
/

ROWS_STARTING_IN_BLOCK     BLOCKS
---------------------- ----------
                     3          1
                    18         22
                    19          1
                       ----------
sum                            24

Looking at these results I can see that there’s a slight variation in the number of rows that could be crammed into a block – and one block which holds the last few rows of my insert statement which I can ignore. In a more realistic case you might need to tweak the selection predicate to make sure that you’ve picked only full-size rows; or you might simply need to decide that you’ve got a choice of two or three possible values for the Hakan factor and see what the results are from using them.

With the same figures above I’d be strongly inclined to set a Hakan factor of 18. That does mean I might be “wasting” roughly 1/19th of every block (for the relatively cases where a 19th row would have fitted) but it’s better than setting the Hakan factor to 19 and finding I get roughly 1 row in every 19 migrating for 22 blocks out of 23 where I should have restricted the number of rows per block to 18; the choice is not always that straightforward.

So here’s how we now “train” Oracle, then test that it learned the lesson:

truncate table test_tab_clone;
insert into test_tab_clone select * from test_tab where rownum <= 18;
alter table test_tab_clone minimize records_per_block;

truncate table test_tab_clone;
insert into test_tab_clone select * from all_objects where rownum <= 10000;

start rowid_count test_tab_clone

ROWS_STARTING_IN_BLOCK     BLOCKS
---------------------- ----------
                    10          1
                    18        555
                       ----------
sum                           556

In the first three statments I’ve emptied the table, inserted 18 rows (I ought to check they all went into the same block, really) and set the Hakan factor.

Once the Hakan factor is set I’ve emptied the table again then populated it with the “full” data set. In fact for demo purposes I’ve copied exactly 10,000 rows so that we can see that every block (except, we can safely assume, the last one written to) has acquired exactly 18 rows.

Part 2 – applying the strategy

It’s often easy to sketch out something that looks like as if it’s exactly what you need, but there are always peripheral considerations that might cause problems and an important part of examining a problem is to consider the overheads and penalties. How, for example, is our OP going to apply the method in production.

There are two problems

  • It’s a large table, and we’re cloning it because we can’t hack directly into the data dictionary to modify the table directly. What are the side effects?
  • We want the imported export to acquire the same Hakan factor. Do we have to take any special action?

The import is the simpler problem to consider since it’s not open-ended. As far as impdp is concerned we could import “data_only” or “include_metadata”, and the “table_exists_action” could be either replace or truncate, so there are only 4 combinations to investigate.

The bad news is that none of the options behaves nicely – impdp (tested on 19.11.0.0) seems to import the data then execute the “minimize records_per_block” command when really it should transfer the Hakan factor before importing the data. So it seems to be necessary to go through the same convoluted steps at least once to precreate a target table with the desired Hakan factor and thereafter use only the truncate option for the import if you want to make the target behave in every way like the source. (Even then you will need to watch out for extreme cases if the export holds fewer rows than the value you’ve set for the Hakan factor – with the special case that if the exported table is empty the attempt by the import to set the Hakan factor raises error “ORA-28603: statement not permitted on empty tables”.)

Let’s get back to the side effects of our cloning exercise on the source table. We’ve created a copy of the original data with a suitable Hakan factor so that blocks holding “completed” rows are full and 1blocks holding “in-gransit” rows have enough space to grow to their “completed” size and there are no migrated rows – and we don’t expect to see migrated rows in the future. But it’s not the right table, and to ensure we had a complete copy we would have stopped all processing of the source table.

Could we have avoided the stoppage? Maybe we could use the dbms_redefinition package – the OP is running Standard Edition so can’t do online redefinition any other way – and use the Hakan hack mechanism on the “interim” table immediately after creating it.

If we find that the online redefinition mechanism generates too much undo and redo we’ll have to use the blocking method – but then we have to do some table renaming and worry about PL/SQL packages becoming invalid, and foreign key constraints, synonyms, views etc. being associated with the wrong table.

So even though we can sketch out with an outline strategy there are still plenty of details to worry about around the edges. To a large degree this is because Oracle has not yet made the Hakan factor a “proper” property of a table that you can explicitly set in a “move” or “create table” operation . There is a function embedded in the executable (kkdxFixTableHAKAN) that looks as if it should set the Hakan factor, and there is presumably some piece of code that sets the Hakan factor when you exectute a call to “create table for exchange”, it would be nice if there was an API that was visible to DBAs.

Summary

If you have a table where rows grows significantly over their lifetime, you ought to ensure that you’ve set a suitable pctfree for the table. But if you anticipate copying, or moving the table at any time then there’s no way to pick a pctfree that is good for all stages of the data’s lifetime.

There is a feature that you can impose on the data to avoid the problems of extreme change in row-lengths and it’s fairly straightforward to impose on a single table but there is no API available to manipulate the feature directly and if you don’t anticipate the need during the initial design stage then applying the feature after the event can be an irritating and resource-intensive operation.

Footnote

For those not familiar with it, the Hakan Factor was introduced by Oracle to allow a little extra efficiency in the compression and use of bitmap indexes. If Oracle has information about the largest number of rows that can appear in any block in a table it can minimise the number of bits needed per block (space saving) and avoid having to expand and compare unnecessarily long sequences of zero bits when comparing entries across bitmap indexes. Given their intended use it should come as no surprise that you can’t call “minimize records_per_block” for a table that has an existing bitmap index.

November 21, 2022

Row_number() sorts

Filed under: Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 5:47 pm GMT Nov 21,2022

An email on the Oracle-L list server a few days ago described a performance problem that had appeared after an upgrade from 11.2.0.4 to 19c (19.15). A long running statement (insert as select, running parallel 16) that had run to completion in 11g using about 20GB of temporary space (with 50GM read and written) had failed after running for a couple of hours in 19c and consuming 2.5 TB of temporary space, even when the 11g execution plan was recreated through an SQL Profile.

When I took a look at the SQL Monitor report for 19c it turned out that a large fraction of the work done was in an operation called WINDOW CHILD PUSHED RANK which was there to deal with a predicate:

row_number() over(partition by t.ds_no, t.c_nbr order by c.cpcl_nbr desc) = 1

Checking the succesful 11g execution, this operation had taken an input rowsource of 7 billion rows and produced an output rowsource of 70 million rows.

Checking the SQL Monitor report for the failed executions in 19c the “pure” 19c plan had reported 7 billion input rows, 6GB memory and 1TB temp space at the same point, the plan with the 11g profile had reported 10 billion rows, but the operation had not yet reported any output rows despite reporting 9GB as the maximum memory allocation and 1TB as the maximum temp space usage. (Differences in row counts were probably due to the report being run for different dates.)

So, the question to the list server was: “is this a bug in 19c?”

Modelling

It’s a little unfortunate that I couldn’t model the problem in 19c at the time because my 19c VM kept crashing; but I built a very simple model to allow me to emulate the window sort and rank() predicate in an 11g instance, then re-played the model in an instance of 21c.

For the model data I took 50 copies of the first 50,000 rows from view all_objects to produce a table of 2,500,000 rows covering 35,700 blocks and 279 MB, (55,000 / 430 in 21c); then I ran the query below and reported its execution plan with a basic call to dbms_xplan.display_cursor():

select
        /*+ dynamic_sampling(0) */
        owner, max(object_name)
from    (
        select 
                /*+ no_merge */
                owner, object_name 
        from    (
                select 
                        owner, object_name,
                        row_number() over (partition by object_name order by object_type desc) orank 
                from 
                        t1
                )  where orank= 1
        )
group by 
        owner
order by
        owner
/

-------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |       |       |       | 29491 (100)|          |
|   1 |  SORT GROUP BY             |      |     8 |   184 |       | 29491   (9)| 00:02:28 |
|   2 |   VIEW                     |      |  2500K|    54M|       | 28532   (6)| 00:02:23 |
|*  3 |    VIEW                    |      |  2500K|   112M|       | 28532   (6)| 00:02:23 |
|*  4 |     WINDOW SORT PUSHED RANK|      |  2500K|    95M|   124M| 28532   (6)| 00:02:23 |
|   5 |      TABLE ACCESS FULL     | T1   |  2500K|    95M|       |  4821   (8)| 00:00:25 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY
              INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Oracle 21c produced the same execution plan – though the row estimate for the VIEW operations (numbers 2 and 3) was a more realistic 46,236 (num_distinct recorded for object_name) compared to the unchanged 2,500,000 from 11g. (Of course it should have been operation 4 that showed the first drop in cardinality.)

With my first build, the timings weren’t what I expected: under 21c the query completed in 3.3 seconds, under 11g it took 11.7 seconds. Most of the difference was due to a large (55MB) spill to temp space that appeared in 11g but not in 21c. This would have been because 11g wasn’t allowed a large enough PGA, so I set the workarea_size_policy to manual and the sort_area_size to 100M, which looks as if it should have been enough to cover the 11g requirement – it wasn’t and I had to grow the sort_area_size to 190 MB before the 11g operation completed in memory, allocating roughly 155MB. By comparison 21c reported an increase of only 19MB of PGA to run the query, claiming that it needed only 4.7MB to handle the critical operation.

For comparison purposes here are the two run-time execution plans, with rowsource execution stats (which messed the timing up a little) and the column projection information; 11g first:

-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 29491 (100)|      8 |00:00:03.96 |   35513 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |      8 |   184 |       | 29491   (9)|      8 |00:00:03.96 |   35513 |  3072 |  3072 | 2048  (0)|
|   2 |   VIEW                     |      |      1 |   2500K|    54M|       | 28532   (6)|  28575 |00:00:04.07 |   35513 |       |       |          |
|*  3 |    VIEW                    |      |      1 |   2500K|   112M|       | 28532   (6)|  28575 |00:00:03.93 |   35513 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|    95M|   124M| 28532   (6)|   1454K|00:00:08.82 |   35513 |   189M|  4615K|  168M (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|    95M|       |  4821   (8)|   2500K|00:00:10.85 |   35513 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1) "OWNER"[VARCHAR2,30], MAX("OBJECT_NAME")[30]
   2 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30]
   3 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30], "ORANK"[NUMBER,22]
   4 - (#keys=2) "OBJECT_NAME"[VARCHAR2,30], INTERNAL_FUNCTION("OBJECT_TYPE")[19], "OWNER"[VARCHAR2,30], ROW_NUMBER() OVER ( PARTITION BY
       "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )[22]
   5 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30], "OBJECT_TYPE"[VARCHAR2,19]

It’s an interesting oddity, and possibly a clue about the excess memory and temp space, that the A-Rows column for the Window Sort operation reports 1,454K rows output when it surely ought to be the final 45,982 at that point. It’s possible to imagine a couple of strategies that Oracle might be following to do the window sort that would reasult in the excess volume appearing, and I’ll leave it to the readers to use their imagination on that one.

And now 21c

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 48864 (100)|     12 |00:00:02.98 |   54755 |  54750 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |     12 |   852 |       | 48864   (1)|     12 |00:00:02.98 |   54755 |  54750 |  5120 |  5120 | 4096  (0)|
|   2 |   VIEW                     |      |      1 |  46236 |  3205K|       | 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |       |       |          |
|*  3 |    VIEW                    |      |      1 |  46236 |  6547K|       | 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|   131M|   162M| 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |  5297K|   950K| 4708K (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|   131M|       | 15028   (1)|   2500K|00:00:00.28 |   54755 |  54750 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1; rowset=256) "OWNER"[VARCHAR2,128], MAX("OBJECT_NAME")[128]
   2 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128]
   3 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128], "ORANK"[NUMBER,22]
   4 - (#keys=2; rowset=256) "OBJECT_NAME"[VARCHAR2,128], "OBJECT_TYPE"[VARCHAR2,23], "OWNER"[VARCHAR2,128], ROW_NUMBER() OVER ( PARTITION BY
       "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )[22]
   5 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128], "OBJECT_TYPE"[VARCHAR2,23]

In this case we see the A-rows from the Window Sort meeting our expectations – but that may be a beneficial side effect of the operation completing in memory.

Optimisation (?)

Given the dramatically different demands for memory for a query that ought to do the same thing in both versions it looks as if 21c may be doing something clever that 11g doesn’t do, or maybe doesn’t do very well, or maybe tries to do but has a bug that isn’t dramatic enough to be obvious unless you’re looking closely.

Here’s a script that I used to build the test data, with scope for a few variations in testing. You’ll notice that the “create table” includes an “order by” clause that is close to the sorting requirement of the over() clause that appears in the query. The results I’ve show so far were for data that didn’t have this clause in place.

rem
rem     Script:         analytic_sort_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2022
rem
rem     Last tested
rem             21.3.0.0
rem             11.2.0.4
rem

create table t1 nologging 
as
select 
        ao.*
from
        (select * from all_objects where rownum <= 50000) ao,
        (select rownum from dual connect by rownum <= 50)
order by
        object_name, object_type -- desc
/

--
--      Stats collection to get histograms
--

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 254'
        );
end;
/

--
-- reconnect here to maximise visibility of PGA allocation
--

connect xxxxxxxx/xxxxxxxx

set linesize 180
set trimspool on
set tab off

-- alter session set workarea_size_policy = manual;
-- alter session set sort_area_size = 199229440;

alter session set events '10046 trace name context forever, level 8';
-- alter session set statistics_level = all;
-- alter session set "_rowsource_execution_statistics"= true;

spool analytic_sort_2

select
        /*  monitoring */
        owner, max(object_name)
from    (
        select 
                /*+ no_merge */
                owner, object_name 
        from    (
                select 
                        owner, object_name,
                        row_number() over (partition by object_name order by object_type desc) orank 
                from 
                        t1
                )  where orank= 1
        )
group by 
        owner
order by
        owner
/

select * from table(dbms_xplan.display_cursor(format=>'cost bytes allstats last projection'));

alter session set events '10046 trace name context off';
alter session set "_rowsource_execution_statistics"= false;
alter session set statistics_level = typical;
alter session set workarea_size_policy = auto;

spool off

The results I’m going to comment on now are the ones I got after running the script as above, then reconnecting and flushing the shared pool before repeat the second half of the script (i.e. without recreating the table).

In 11g, going back to the automatic workarea sizing the session used 37MB of memory and then spilled (only) 3MB to temp. The run time was approximately 3 seconds – which is a good match for the “unsorted” 21c run time. As with the original tests, the value reported in A-rows is larger than we would expect (in this case suspiciously close to twice the correct values – but that’s more likely to be a coincidence than a clue). Interestingly, when I switched to the manual workarea_size_policy and set the sort_area_size to 190MB Oracle said “that’s the optimum memory” and used nearly all of it to complete in memory – for any value less than that (even down to 5MB) Oracle spilled just 3 MB to disk in a one-pass operation. So it looks as if Oracle “knows” it doesn’t need to sort the whole data set, but still uses as much memory as is available to do something before it starts to get clever.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 29491 (100)|      8 |00:00:01.76 |   35523 |   2145 |    331 |       |       |          |         |
|   1 |  SORT GROUP BY             |      |      1 |      8 |   184 |       | 29491   (9)|      8 |00:00:01.76 |   35523 |   2145 |    331 |  2048 |  2048 | 2048  (0)|         |
|   2 |   VIEW                     |      |      1 |   2500K|    54M|       | 28532   (6)|  28575 |00:00:02.00 |   35523 |   2145 |    331 |       |       |          |         |
|*  3 |    VIEW                    |      |      1 |   2500K|   112M|       | 28532   (6)|  28575 |00:00:01.83 |   35523 |   2145 |    331 |       |       |          |         |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|    95M|   124M| 28532   (6)|  57171 |00:00:02.10 |   35523 |   2145 |    331 |  2979K|   768K|   37M (1)|    3072 |
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|    95M|       |  4821   (8)|   2500K|00:00:11.84 |   35513 |   1814 |      0 |       |       |          |         |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In 21c there’s essentially no difference between the sorted and unsorted tests, which suggests that with my data the session had started finding been able to apply its optimisation strategy at the earliest possible moment rather than waiting until it had no alternative but to spill to disc.

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 48864 (100)|     12 |00:00:00.98 |   54753 |  54748 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |     12 |   852 |       | 48864   (1)|     12 |00:00:00.98 |   54753 |  54748 |  4096 |  4096 | 4096  (0)|
|   2 |   VIEW                     |      |      1 |  46236 |  3205K|       | 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |       |       |          |
|*  3 |    VIEW                    |      |      1 |  46236 |  6547K|       | 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|   131M|   162M| 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |  5155K|   940K| 4582K (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|   131M|       | 15028   (1)|   2500K|00:00:00.42 |   54753 |  54748 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Assumption

Given the way that 11g reports a very small spill to disc, which stays fairly constant in size no matter how large or small the available PGA allocation is, when the input data is sorted to help the over() clause, and given how large the spill to disc can become when the data is not sorted, I feel that Oracle has an optimisation that discards input rows early in the analytic window sort. But we also have some evidence of a flaw in the code in versions prior to 21c that means Oracle fails to re-use memory that becomes available from rows that have been discarded.

Strategy

I’ve said in the past that if you’re using analytic functions you ought to minimise the size of the data you’re processing before you apply the analytic part. Another step that can help is to make sure you’ve got the data into a (fairly well) sorted order before you reach the analytic part.

In the case of versions of Oracle prior to 21c, it also seems to make sense (if you can arrange it) to minimise the reduce the amount of memory the session is allowed to use for a sort operation, as this will reduce the CPU used by the session and avoid grabbing excess redundant memory that could be used more effectively by other sessions.

Addendum

Just before publishing I found a way of keeping my 19.11.0.0 instance alive long enough to run the tests, then also ran them on an instance of 12.2.0.1. Both versions showed the same pattern of doing a large allocation of memory and large spill to disc when the data was not sorted, and a large allocation of memory but a small spill to disc when the data was sorted.

As a little sanity check I also exported the 19c data and imported it to 21c in case it was a simple variation in the data that allwoed made 21c to operate more efficiently than19c. The change in data made no difference to the way in which 21c handled it, in both cases it called for a small allocation of memory with no spill to disc.

November 16, 2022

Lost Space

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 12:56 pm GMT Nov 16,2022

I’ve just discovered that the space management bitmaps for the tablespace I normally use in my 21c tests are broken. In a tablespace that’s supposed to be completely empty a query of dba_free_space shows 4 gaps totalling several thousand blocks:

SQL> select * from dba_free_space where tablespace_name = 'TEST_8K_ASSM';

TABLESPACE_NAME                   FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
------------------------------ ---------- ---------- ---------- ---------- ------------
TEST_8K_ASSM                           13        128     327680         40           13
TEST_8K_ASSM                           13        216    1376256        168           13
TEST_8K_ASSM                           13       1792     720896         88           13
TEST_8K_ASSM                           13       1896     196608         24           13
TEST_8K_ASSM                           13       9600  969932800     118400           13

Of course I ran my script to drop all segments and purge the recyclebin when I first saw this, but that didn’t help, and a query against dba_segments showed no segments, and a query against seg$ showed nothing in the file. So somehow the bits are bust.

Fortunately there’s a dbms_space_admin package with a procedure tablespace_verify() that I’ve wanted to test for some time – the documentation is a little sparse about how it works. So here’s a cut-and-paste of my first (and second) call to the procedure, executing from the SYS schema and passing in the tablespace name:

SQL> execute dbms_space_admin.tablespace_verify ('TEST_8K_ASSM')
BEGIN dbms_space_admin.tablespace_verify ('TEST_8K_ASSM'); END;

*
ERROR at line 1:
ORA-20000: BitMap entry partially used with no Extent Map entry
TSN 6: Range RelFno 13: ExtNo: 32702 BeginBlock: 0 EndBlock: 4194303
ORA-06512: at "SYS.DBMS_SPACE_ADMIN", line 83
ORA-06512: at line 1

SQL> execute dbms_space_admin.tablespace_verify ('TEST_8K_ASSM')
BEGIN dbms_space_admin.tablespace_verify ('TEST_8K_ASSM'); END;

*
ERROR at line 1:
ORA-20000: BitMap entry partially used with no Extent Map entry
TSN 6: Range RelFno 13: ExtNo: 32766 BeginBlock: 0 EndBlock: 4194303
ORA-06512: at "SYS.DBMS_SPACE_ADMIN", line 83
ORA-06512: at line 1


The output isn’t promising – but we can, at least, see that it’s the right RelFno, and the Extno: seems to have moved on by 64 (which is a nice number in an abstract, computational way), but what might the Extno: be? And I know that I’ve only got 128,000 blocks in the file and it’s not set to auto-extend so that EndBlock: value is a little worrying.

Just to add a little more confusion – the next few calls reported the ExtNo: as 0, then stuck at 32,766. So it probably wasn’t walking the files bitmap blocks as I first guessed.

What to do next? In my case I could throw the tablespace away – there was nothing in it, and even if there were I could have recreated it very easily – so I was happy to try the next dbms_space_admin feature: tablespace_fix_bitmaps(). Here’s the declaration:

  procedure tablespace_fix_bitmaps(
        tablespace_name         in    varchar2 ,
        dbarange_relative_file  in    positive ,
        dbarange_begin_block    in    positive ,
        dbarange_end_block      in    positive ,
        fix_option              in    positive
  );
  --
  --  Marks the appropriate dba range (extent) as free/used in bitmap
  --  Input arguments:
  --   tablespace_name         - name of tablespace
  --   dbarange_relative_file  - relative fileno of dba range (extent)
  --   dbarange_begin_block    - block number of beginning of extent
  --   dbarange_end_block      - block number (inclusive) of end of extent
  --   fix_option              - TABLESPACE_EXTENT_MAKE_FREE or
  --                             TABLESPACE_EXTENT_MAKE_USED

Again the documentation is a little sparse, so I’m just going to cross my fingers and hope for the best – proceeding a little cautiously. Looking at the report of free space I can infer from the first two lines that the bits for blocks 168 (128 + 40) to 215 (216 – 1) are marked as used. So I’ll try to pass that information into the procedure call:

set serveroutput on
set linesize 132
set trimspool on
set tab off


begin
        dbms_space_admin.tablespace_fix_bitmaps(
                tablespace_name         => 'TEST_8K_ASSM',
                dbarange_relative_file  => 13,
                dbarange_begin_block    => 168,
                dbarange_end_block      => 215,
                fix_option              => dbms_space_admin.TABLESPACE_EXTENT_MAKE_FREE
        );
end;
/

PL/SQL procedure successfully completed.

SQL> select * from dba_free_space where tablespace_name = 'TEST_8K_ASSM';

TABLESPACE_NAME                   FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
------------------------------ ---------- ---------- ---------- ---------- ------------
TEST_8K_ASSM                           13        128    2097152        256           13
TEST_8K_ASSM                           13       1792     720896         88           13
TEST_8K_ASSM                           13       1896     196608         24           13
TEST_8K_ASSM                           13       9600  969932800     118400           13

Comparing the new results from dba_free_space we can see that we’ve eliminated the “used” chunk that was between the first two free chunks and now have a single free chunk stretching from block 128 to block 383. So now we rinse and repeat – and we could use dba_free_space to help by generating a list of begin and end blocks – we might even consider writing a query to drive a cursor loop (being very careful to allow for multi-file tablespaces, which I haven’t done):

select
        relative_fno, block_id, block_id + blocks begin_block,
        lead(block_id) over (order by relative_fno, block_id) - 1  end_block
from
        dba_free_space
where
        tablespace_name = 'TEST_8K_ASSM'
order by
        relative_fno, block_id
/

 RELATIVE_NO   BLOCK_ID BEGIN_BLOCK  END_BLOCK
------------ ---------- ----------- ----------
          13        128         384       1791
          13       1792        1880       1895
          13       1896        1920       9599
          13       9600      128000

After three more calls to tablespace_fix_bitmaps() this is the result I got from my query against dba_free_space – followed by a call to tablespace_verify():

TABLESPACE_NAME                   FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
------------------------------ ---------- ---------- ---------- ---------- ------------
TEST_8K_ASSM                           13        128 1047527424     127872           13

SQL>  execute dbms_space_admin.tablespace_verify ('TEST_8K_ASSM')

PL/SQL procedure successfully completed.

Summary

After finding a tablespace that should have shown nothing but free space along its whole length (and checking the recyclebin, and the underlying seg$ table) I called dbms_space_admin.tablespace_verify() to see what it thought was going on and it reported an inconsistency between the tablespace (file) bitmap and segment bitmaps (in this case because there were no segment bitmaps when the file bitmap said there ought to be).

Starting from a query against dba_free_space I worked out the ranges of blocks that were marked in the file bitmap as used when they shouldn’t have been, and called dbms_space_admin.tablespace_fix_bitmaps() for each range.

After fixing all the bad ranges I called tablespace_verify() again to see if it had any more complaints,, and got an empty report.

Footnotes

The documentation is not user-friendly, and it would be nice to have some comments in the manaul (or dbmsspc.sql script) describing possible outputs. On the other hand I managed to avoid reading the documentation carefully enough anyway, because it wasn’t until I started searching MOS for better documentation that I realised I should have used the ASSM version of verify

execute dbms_space_admin.assm_tablespace_verify ('TEST_8K_ASSM', dbms_space_admin.ts_verify_bitmaps)

This procedure might have reported sensible information for the Extno, BeginBlock and EndBlock. But it was too late to find out – I’ll just wait for the next corruption to happen.

There is one circumstance where you might see multiple chunks in dba_free_space when there are no segments allocated, but with no gaps between chunks – if Oracle has to “grow” the bitmap for a file then the separate chunks of the bitmap report their freespace separately.

Another possibility for multiple free space chunks when there are no (ordinary) segments is if you’ve moved the tablespace bitmap or converted a dictionary managed tablespace to a locally managed tablespace – again a rare occurrence – in which case the tablespace bitmap will be in a “nearly-hidden” segment.

November 15, 2022

opt_estimate 4a

Filed under: CBO,Execution plans,Hints,Oracle,Tuning — Jonathan Lewis @ 11:21 am GMT Nov 15,2022

I wrote a batch of notes about the opt_estimate() hint a couple of years ago, including one where I explained the option for using the hint to specify the number of rows in a query block. I’ve just come across a particular special case for that strategy that others might find a use for. It’s something to do whant using the “select from dual … connect by” trick for multiplying rows.

Here’s a little data to model the idea – I’ve used the all_tables view to generate some “well-known” data since I want to add a tiny bit of complexity to the query while still leaving it easy to understand the index. The results from this demonstration come from Oracle 21.3.0.0, and I’ve included the hint /*+ no_adaptive_plan */ to stop Oracle from getting too clever during optimisation.

rem
rem     Script:         opt_estimate_dual.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2022
rem
rem     Last tested 
rem             21.3.0.0
rem

create table tables_table as select * from all_objects where object_type = 'TABLE';
create table objects_table as select * from all_objects;

alter table objects_table add constraint ot_pk primary key(object_id);

begin
        dbms_stats.gather_table_stats(
                ownname    => user,
                tabname    => 'tables_table',
                method_opt => 'for columns size 60 owner'
        );
end;
/

set serveroutput off

with driver as (
        select  /*+ materialize */
                tt.owner, tt.object_id, v1.rn
        from    tables_table tt,
                (
                select
                        /*+  opt_estimate(query_block scale_rows=10) */
                        rownum rn
                from    dual
                connect by
                        level <= 10
                ) v1
        where
                tt.owner = 'OUTLN'
)
select  /*+ no_adaptive_plan */
        dr.rn, dr.owner, dr.object_id,
        ot.object_id, ot.owner, ot.object_type, ot.object_name
from
        driver dr,
        objects_table   ot
where
        ot.object_id = dr.object_id
/

select * from table(dbms_xplan.display_cursor(format => 'hint_report'));


In my system tables_table holds 727 rows and objects_table holds 58383 rows. Three rows in tables_table correspond to tables owned by user ‘OUTLN’ which means I expect the driver CTE (common table expression / “with” subquery) to generate 30 rows and, given the join on unique id, the query to return 30 rows.

I’ve used the /*+ materialize */ hint to force Oracle to create an in-memory temporary table for the driver CTE, the /*+ no_adaptive_plan */ hint to stop Oracle from getting too clever during optimisation, and the critical /*+ opt_estimate() */ hint to help the optimizer understand the effect of my “connect by” on dual. Here’s the execution plan I get if I omit that last hint:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |    14 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6632_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |     3 |    78 |     9   (0)| 00:00:01 |
|   4 |     VIEW                                 |                            |     1 |    13 |     2   (0)| 00:00:01 |
|   5 |      COUNT                               |                            |       |       |            |          |
|   6 |       CONNECT BY WITHOUT FILTERING       |                            |       |       |            |          |
|   7 |        FAST DUAL                         |                            |     1 |       |     2   (0)| 00:00:01 |
|   8 |     BUFFER SORT                          |                            |     3 |    39 |     9   (0)| 00:00:01 |
|*  9 |      TABLE ACCESS FULL                   | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|  10 |   NESTED LOOPS                           |                            |     3 |   453 |     5   (0)| 00:00:01 |
|  11 |    NESTED LOOPS                          |                            |     3 |   453 |     5   (0)| 00:00:01 |
|  12 |     VIEW                                 |                            |     3 |   276 |     2   (0)| 00:00:01 |
|  13 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6632_31D19D4 |     3 |    78 |     2   (0)| 00:00:01 |
|* 14 |     INDEX UNIQUE SCAN                    | OT_PK                      |     1 |       |     0   (0)|          |
|  15 |    TABLE ACCESS BY INDEX ROWID           | OBJECTS_TABLE              |     1 |    59 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - filter("TT"."OWNER"='OUTLN')
  14 - access("OT"."OBJECT_ID"="DR"."OBJECT_ID")

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   0 -  STATEMENT
           -  no_adaptive_plan

   2 -  SEL$1
           -  materialize


I’ve highlighted operations 4 and 8 in the plan: operation 4 is the view of dual that has generated 10 rows – unfortunately the optimizer has only considered the stats of the dual table, and hasn’t factored in the effects of the “connect by with rownum”. Operation 8 shows us that the optimizer has (correctly, thanks to the histogram I requested) estimated 3 rows for the tablescan of tables_table. The result of these two estimates is that operation 3 reports an estimate of 3 ( = 3 * 1 ) rows to be used in probing objects_table.

This is the plan after enabling the /*+ opt_estimate() */ hint:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |    45 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6633_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |    30 |   780 |    13   (0)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL                    | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|   5 |     BUFFER SORT                          |                            |    10 |   130 |     6   (0)| 00:00:01 |
|   6 |      VIEW                                |                            |    10 |   130 |     2   (0)| 00:00:01 |
|   7 |       COUNT                              |                            |       |       |            |          |
|   8 |        CONNECT BY WITHOUT FILTERING      |                            |       |       |            |          |
|   9 |         FAST DUAL                        |                            |     1 |       |     2   (0)| 00:00:01 |
|  10 |   NESTED LOOPS                           |                            |    30 |  4530 |    32   (0)| 00:00:01 |
|  11 |    NESTED LOOPS                          |                            |    30 |  4530 |    32   (0)| 00:00:01 |
|  12 |     VIEW                                 |                            |    30 |  2760 |     2   (0)| 00:00:01 |
|  13 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6633_31D19D4 |    30 |   780 |     2   (0)| 00:00:01 |
|* 14 |     INDEX UNIQUE SCAN                    | OT_PK                      |     1 |       |     0   (0)|          |
|  15 |    TABLE ACCESS BY INDEX ROWID           | OBJECTS_TABLE              |     1 |    59 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("TT"."OWNER"='OUTLN')
  14 - access("OT"."OBJECT_ID"="DR"."OBJECT_ID")

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   0 -  STATEMENT
           -  no_adaptive_plan

   2 -  SEL$1
           -  materialize


There are three things that stand out in this report.

  • I’ve highlighted operations 4 and 6: operation 4 is the tablescan of tables_table that correctly estimates 3 rows; operation 6 is the view operation that now correctly estimates 10 rows.
  • With the correct estimate for the view the estimate for the join to objects_table is now correct and the join order for the merge join cartesian at operation 3 has been reversed.
  • The Hint Report tells us that the opt_estimate() hint is not (always) an optimizer hint! This is a real pain because when the opt_estimate() hints you’ve tried to use don’t appear to work it’s not easy to work out what you’ve done wrong.

To make a point, I can take the demo a little further by changing the /*+ opt_estimate() */ hint to scale_rows=120. Here’s the body of the resulting plan:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |   369 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D663A_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |   360 |  9360 |    13   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL                    | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|   5 |     BUFFER SORT                          |                            |   120 |  1560 |     6   (0)| 00:00:01 |
|   6 |      VIEW                                |                            |   120 |  1560 |     2   (0)| 00:00:01 |
|   7 |       COUNT                              |                            |       |       |            |          |
|   8 |        CONNECT BY WITHOUT FILTERING      |                            |       |       |            |          |
|   9 |         FAST DUAL                        |                            |     1 |       |     2   (0)| 00:00:01 |
|  10 |   HASH JOIN                              |                            |   360 | 54360 |   356   (1)| 00:00:01 |
|  11 |    VIEW                                  |                            |   360 | 33120 |     2   (0)| 00:00:01 |
|  12 |     TABLE ACCESS FULL                    | SYS_TEMP_0FD9D663A_31D19D4 |   360 |  9360 |     2   (0)| 00:00:01 |
|  13 |    TABLE ACCESS FULL                     | OBJECTS_TABLE              | 58383 |  3363K|   354   (1)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

The earlier plans used a nested loop join into objects_table. In this plan we can see at operation 10 that the optimizer has selected a hash join because the larger row estimate for the CTE has increased the cost of the query beyond the inflection point between nested loop and hash joins.

Summary

If you need to use the “connect by” in an inline view then you may find that the optimizer gets a very bad estimate of the number of rows the view definition will generate and that an /*+ opt_estimate() */ hint in the view using the “scale_rows=nnn” option will produce better estimates of cardinality, hence a better plan.

Footnote

In this particular case where I’ve used the dual table by itself in an inline view I could have used the rows=NNN” option to get the same effect.

In any case I could have added a /*+ qb_name() */ hint to the inline view, and includes a qualifying “@qb” in the /*+ opt_estimate() */ hint.

Using hints is hard, especially when they’re not documented. There is a lot more to learn about this hint; for example, telling the optimizer about the size of a rowsource doesn’t help if it’s going to use its estimate of distinct values in the next steps of the plan – a correction you’ve managed to introduce at one point may disappear in the very next optimizer calculation.

This catalogue lists more articles on the opt_estimate() hint and its relatives.

October 24, 2022

PL/SQL Labels

Filed under: Infrastructure,Oracle — Jonathan Lewis @ 9:06 am BST Oct 24,2022

A tweet from Franck Pachot [July 2021] about fully qualified names in Postgres prompted me to highlight a note I wrote a few years ago about using the label mechanism in Oracle’s PL/SQL to avoid collisions between variable names and table names. This led to a brief twitter exchange about labels and goto, an associated feature that I had completely forgotten about until I read a very sketchy comment in the scripts I’d used to demonstrate the use of labels to fully qualify variable names.

The comment – which I hadn’t included in the published note – was as follows:

rem     Using labels as targets
rem             goto label_name
rem                     unconditional jump
rem             exit [label_name] when {condition}
rem                     exit to line after labelled loop
rem             continue [label_name] when {condition}
rem                     start next iteration of labelled loop

Having just rediscovered the original blog note and associated script I found that I’d written up a demo of using the three code control mechanisms at the time of the discussion but not got around to publishing it, so here it is now:

rem
rem     Script:         plsql_block_names_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Aug 2021
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem

create or replace procedure demo_labels(p_control number)
as
begin
        if demo_labels.p_control = 0 then
                goto early_exit;
        end if;

        <<outer_loop>>
        for i in 1..10 loop
                dbms_output.put_line('Starting Inner');
                <<inner_loop>>
                for i in 1..4 loop
                        exit inner_loop when outer_loop.i > 3;
                        dbms_output.put_line(inner_loop.i);
                        continue inner_loop when demo_labels.p_control = 1;
                        dbms_output.put_line('Control != 1');
                        continue outer_loop when demo_labels.p_control = 2;
                end loop inner_loop;
                dbms_output.put_line('Ended inner');
        end loop outer_loop;

        <<normal_exit>>
        dbms_output.put_line('===========');
        dbms_output.put_line('Normal Exit');
        dbms_output.put_line('===========');
        goto terminate;

        <<early_exit>>
        dbms_output.put_line('==========');
        dbms_output.put_line('Early Exit');
        dbms_output.put_line('==========');

        <<Terminate>>
        null;

end;
/

I’ve created a procedure called demo_labels, and you can see at line 14 an example of qualifying a variable name with the name of the procedure. In fact I’ve used this example to show that you can (and should) qualify the names of the formal parameters to the procedure.

Inside this procedure I’ve created labels for the two loops, <<outer_loop>> and <<inner_loop>> and, again, you can see cases (lines 23 and 24) where I’ve used the loop names to qualify the names of variables declared for the loop. You’ll notice that I (deliberately) used the same index variable name for both loops – this type of thing is usually an error waiting to happen but, by qualifying the variables at every use, I’ve pre-empted the possible “capture” problem of one use of the variable name hiding another use of the same name.

I’ve also shown the use of an exit inner_loop, and a continue with both inner_loop and outer_loop; and I’ve also used the loop names to identify clearly which loop is ending on an end loop.

Finally I’ve created three further labels as potential targets for transferring execution, of which I’ve only used early_exit and terminate.

Here’s a little script I’ve then run to show the effects – you might like to work out what’s going to happen before scrolling down to the comments and output:

set feedback off
set serveroutput on

prompt  ==================== 0 ==================== 
execute demo_labels(0)

prompt  ==================== 1 ==================== 
execute demo_labels(1)

prompt  ==================== 2 ==================== 
execute demo_labels(2)

With zero as the input the procedure immediately jumps to the early_exit label, runs through the next 4 commands and returns.

==================== 0 ====================
==========
Early Exit
==========

With 1 as the input we start the outer loop, then start the inner loop and print the inner loop counter for the first time, but because we meet the requirements of the continue inner_loop at line 25 we drop to the end of the inner loop and go round again (you may prefer to think of this as going back to the top of the loop, I just happen to find it more natural to think of ending the cycle and starting a new one) for a total of 4 times, then print “Ended Inner” and go round the outer loop a second and third time doing exactly the same thing.

When we go round the outer loop for the 4th and subsequent cycles we will immediately exit inner_loop (line 23), which means we print “Ended Inner” and go round the outer again. Finally we’ll complete 10 cycles of the outer loop, work through the normal_exit and goto terminate.

==================== 1 ====================
Starting Inner
1
2
3
4
Ended inner
Starting Inner
1
2
3
4
Ended inner
Starting Inner
1
2
3
4
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
===========
Normal Exit
===========

With 2 as the input we start the outer loop, then start the inner loop and print “Starting Inner”, then print “1”, but at line 25 we don’t jump to the end of the inner loop, we fall through to the line 26, print “Control != 1”, then continue to the end of the outer loop, and cycle round again. Repeating the three lines of output 3 times, then (as with input 1) line 23 makes us jump to the end of the inner loop for the next 7 cycles of the outer loop before we pass through the normal exit.

==================== 2 ====================
Starting Inner
1
Control != 1
Starting Inner
1
Control != 1
Starting Inner
1
Control != 1
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
===========
Normal Exit
===========

Playing around with other input values is left as an exercise.

October 13, 2022

v$session_longops

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 10:30 am BST Oct 13,2022

There’s a question on the Oracle developer forum at the moment asking how a tablescan could be reported as taking 94,000 seconds so far when a count(*) shows that it holds only a couple of hundred thousand rows (and it’s not storing megabytes of LOB per row if that’s your first guess).

I can think of a few reasons so I don’t know if I’ve supplied the correct explanation for the OP’s example, but it prompted me to point out that Oracle can provide several different perspectives on performance that can seem to be counter-intuitive or contradictory if you don’t realise what they’re trying to tell you.

The data supplied by the OP was initially not very readable so I’ll just point out that v$session_longops can report the sql_id, plan_hash_value and line_id for each operation, so when it reports a message like:

Table Scan: XXXXXXXXX: 158204 out of 204924 Blocks done

you can pull the query and execution plan from memory very easily.

This is the execution plan – with some cosmetic cleaning – supplied by the OP, with the comment that the tablescan reported by v$session_longops was operation 22 (highlighted):

SORT AGGREGATE
  VIEW 
    UNION-ALL 
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN RIGHT OUTER
              INLIST ITERATOR 
                TABLE ACCESS BY INDEX ROWID BATCHED VAC_000087FD
                  INDEX RANGE SCAN I_220101_VAC_000087FD_1
              TABLE ACCESS FULL VLD_00008866
          TABLE ACCESS FULL CCH_00008868
      HASH JOIN ANTI SNA
        FILTER 
          NESTED LOOPS OUTER
            FILTER 
              HASH JOIN RIGHT OUTER
                VIEW 
                  HASH JOIN RIGHT ANTI SNA
                    INDEX FULL SCAN SYS_C00364276
                    TABLE ACCESS FULL VAC_00006D1B
                TABLE ACCESS FULL VLD_00006D22
            VIEW PUSHED PREDICATE 
              FILTER 
                MERGE JOIN ANTI NA
                  SORT JOIN
                    TABLE ACCESS BY INDEX ROWID BATCHED CCH_00007BB0
                      INDEX FULL SCAN SYS_C00353046
                  SORT UNIQUE
                    TABLE ACCESS FULL LNK_00008614
        INDEX FAST FULL SCAN SYS_C00364277
      HASH JOIN ANTI SNA
        FILTER 
          NESTED LOOPS OUTER
            FILTER 
              HASH JOIN RIGHT OUTER
                VIEW 
                  HASH JOIN ANTI SNA
                    TABLE ACCESS FULL VAC_00007C1D
                    INDEX FULL SCAN SYS_C00365638
                TABLE ACCESS FULL VLD_00007C24
            VIEW PUSHED PREDICATE 
              FILTER 
                MERGE JOIN ANTI NA
                  SORT JOIN
                    TABLE ACCESS BY INDEX ROWID BATCHED CCH_00007C26
                      INDEX FULL SCAN SYS_C00353233
                  SORT UNIQUE
                    TABLE ACCESS FULL LNK_00008787
        INDEX FAST FULL SCAN SYS_C00365640
      HASH JOIN ANTI SNA
        FILTER 
          HASH JOIN OUTER
            FILTER 
              HASH JOIN RIGHT OUTER
                VIEW 
                  HASH JOIN ANTI SNA
                    INLIST ITERATOR 
                      TABLE ACCESS BY INDEX ROWID BATCHED VAC_00006CC2
                        INDEX RANGE SCAN I_200101_VAC_00006CC2_1
                    INDEX FULL SCAN SYS_C00364113
                TABLE ACCESS FULL VLD_00006CC9
            VIEW 
              HASH JOIN RIGHT ANTI NA
                TABLE ACCESS FULL LNK_000084B2
                TABLE ACCESS FULL CCH_00007BAB
        INDEX FAST FULL SCAN SYS_C00364114
      HASH JOIN ANTI SNA
        FILTER 
          HASH JOIN OUTER
            FILTER 
              HASH JOIN RIGHT OUTER
                VIEW 
                  HASH JOIN RIGHT ANTI SNA
                    INDEX FULL SCAN SYS_C00364266
                    TABLE ACCESS FULL VAC_00006CE2
                TABLE ACCESS FULL VLD_00006CE9
            VIEW 
              HASH JOIN RIGHT ANTI NA
                TABLE ACCESS FULL LNK_000085F0
                TABLE ACCESS FULL CCH_00007BAC
        INDEX FAST FULL SCAN SYS_C00364267
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN RIGHT OUTER
              TABLE ACCESS FULL VAC_00008613
              TABLE ACCESS FULL VLD_00008612
          TABLE ACCESS FULL CCH_00008617
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN OUTER
              TABLE ACCESS FULL VLD_00008785
              TABLE ACCESS FULL VAC_00008786
          TABLE ACCESS FULL CCH_0000878A
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN RIGHT OUTER
              TABLE ACCESS FULL VAC_000084B1
              TABLE ACCESS FULL VLD_000084B0
          TABLE ACCESS FULL CCH_000084B5
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN OUTER
              TABLE ACCESS FULL VLD_000085EE
              TABLE ACCESS FULL VAC_000085EF
          TABLE ACCESS FULL CCH_000085F3

The quick answer to the OP’s question is that operation 22 is the second child to a hash join (operation 17) that passes its rowsource through a FILTER operation (16) to become the first child of a nested loop (operation 15).

This means operation 22 is passing its rows up the tree one row at a time (it’s the probe table, not the build table) and the time it takes to process each row is dictated by the second child of the nested loop join. In other words – it might take a tiny amount of work to do the tablescan, but the elapsed time for the tablescan to complete is dictated by the time it takes to call the view (operation 23) for every single row that survives the journey up to operation 15.

Demo

To demonstrate the principle that the “working time” for an operation and the elapsed time to completion can be dramatically different I’ll set up a two-table join and show that a “small tablescan” can (apparently) take a long time and get into v$session_longops because of “the other” table. As a quick and dirty trick I’ll create a function that calls dbms_session.sleep() – the function that should be used to replace calls to dbms_lock.sleep()– to sleep for 1/100 second.

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        rownum                          n1,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6   -- > comment to avoid WordPress format issue
;

create table t2 as select * from t1;
alter table t2 add constraint t2_pk primary key(id);


create or replace function waste_time(i_in number) return number
as
begin
        dbms_session.sleep(0.01);
        return i_in;
end;
/

With the data and function in place I’ll code (and hint) a nested loop join that starts with a full tablescan of t1 and probes t2 by primary key 1,000 times.

set timing on

select
        /*+ leading(t1 t2) full(t1) use_nl_with_index(t2) */
        sum(t1.id)
from
        t1, t2
where
        mod(t1.id,1000) = 0
and     t2.id  = t1.id
and     t2.n1 != 0
/

select
        /*+ leading(t1 t2) full(t1) use_nl_with_index(t2) */
        sum(t1.id)
from
        t1, t2
where
        mod(t1.id,1000) = 0
and     t2.id  = t1.id
and     waste_time(t2.n1) != 0
/

set timing off

Of course, thanks to the call to waste_time() passing in t2.n1 I expect the second version of the query to take at least 10 seconds longer than the first (given 1,000 waits of 0.01 seconds spent in the call).

SUM(T1.ID)
----------
 500500000

1 row selected.

Elapsed: 00:00:00.26

SUM(T1.ID)
----------
 500500000

1 row selected.

Elapsed: 00:00:14.39

So the question is – what does v$session_longops say about any “long operations” for my session? Query and result:

select 
        sql_id, 
        sql_plan_line_id,
        to_char(vsl.start_time,'dd hh24:mi:ss') start_time, 
        to_char(vsl.last_update_time,'dd hh24:mi:ss') last_time, 
        vsl.elapsed_seconds,
        vsl.message 
from 
        V$session_Longops vsl
where 
        vsl.sid = (select ms.sid from v$mystat ms where rownum = 1)
/

SQL_ID        SQL_PLAN_LINE_ID START_TIME                LAST_TIME   ELAPSED_SECONDS
------------- ---------------- ------------------------- ----------- ---------------
MESSAGE
------------------------------------------------------------------------------------------------------------------------------------
cqv88nkkvrwpv                4 13 09:56:21               13 09:56:35              14
Table Scan:  TEST_USER.T1: 18020 out of 18020 Blocks done

So that looks like 14 seconds to do a tablescan of just 18,020 blocks. The number is very similar to the elapsed time reported for the second of my two queries – but just to make sure let’s use the reported SQL ID to pull the query and plan from memory and check operation 4 for a tablescan of t1.

select * from table(dbms_xplan.display_cursor('cqv88nkkvrwpv', format=>'hint_report'));


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  cqv88nkkvrwpv, child number 0
-------------------------------------
select  /*+ leading(t1 t2) full(t1) use_nl_with_index(t2) */
sum(t1.id) from  t1, t2 where  mod(t1.id,1000) = 0 and t2.id  = t1.id
and waste_time(t2.n1) != 0

Plan hash value: 1846150233

---------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |       |       | 24956 (100)|          |
|   1 |  SORT AGGREGATE               |       |     1 |    15 |            |          |
|   2 |   NESTED LOOPS                |       | 10000 |   146K| 24956   (1)| 00:00:01 |
|   3 |    NESTED LOOPS               |       | 10000 |   146K| 24956   (1)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL         | T1    | 10000 | 50000 |  4936   (2)| 00:00:01 |
|*  5 |     INDEX UNIQUE SCAN         | T2_PK |     1 |       |     1   (0)| 00:00:01 |
|*  6 |    TABLE ACCESS BY INDEX ROWID| T2    |     1 |    10 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(MOD("T1"."ID",1000)=0)
   5 - access("T2"."ID"="T1"."ID")
   6 - filter("WASTE_TIME"("T2"."N1")<>0)

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 3
---------------------------------------------------------------------------

   1 -  SEL$1
           -  leading(t1 t2)

   4 -  SEL$1 / T1@SEL$1
           -  full(t1)

   5 -  SEL$1 / T2@SEL$1
           -  use_nl_with_index(t2)

Summary

When you see an entry in v$session_longops it is an indicator to an operation that took a “long” time to complete; but “completion” of the operation and “work done” by the operation are not the same thing. The operation may be the victim of a problem, not the cause. If the problem query is still in memory then v$session_long_ops gives you enough information to find the query (and check you’re looking at the right plan) so that you have a better chance of identifying the real offender.

October 11, 2022

Tracing Tip #JoelKallmanDay

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 10:27 am BST Oct 11,2022

I don’t really do tips because I often see simple tips that have a specific purpose being abused; but it’s a special day in the Oracle community so here’s a quick tip in tribute to a great sharer.

I’ve been doing some work with datapump recently and needed to get a better picture of how various processes were hanging together. But when you call expdp or impdp you don’t have direct control over all the processes that might start running – and that covers a DMxx (datapump master) process, multiple DWxx (datapump worker) processes and Pnnn (parallel execution) processes.

As a quick, dirty, and brutal starting point (on a dedicated test system) I just enabled tracing for all the processes I wanted to follow using a variant of the new(er) syntax, specifying process naming patterns:

Here’s what I did for 11g (see Footnote):

alter system set events 'sql_trace {process:pname = p00 | dw | dm } level=8';

This enabled tracing whenever a process with a name (v$process.pname) starting with ‘P00’, ‘DM’ or ‘DW’ started running. (I restricted myself to parallel processes p000 – p009 because I had set a very small value for parallel_max_servers.)

I was a little surprised to discover that this didn’t work when I tried to use it in a 19c PDB:

SQL> alter system set events 'sql_trace {process:pname = dw | dm | p0} level=8';
alter system set events 'sql_trace {process:pname = dw | dm | p0} level=8'
*
ERROR at line 1:
ORA-49100: Failed to process event statement [sql_trace {process:pname = dw | dm | p0} level=8]
ORA-49601: syntax error: found "|": expecting one of: ":" etc..

So I read the detail in the error message and changed the statement (after two wrong guesses) to:

alter system set events 'sql_trace {process:pname = dw | process:pname=dm | process:pname=p00} level=8';

This worked.

Footnote

I’ve updated the SQL I used for the 11g trace – the process parameter used to look like this: “process:pname = dw | dm | p00”, but this didn’t actually work and only appeared to work because (I assume) of an earlier trace call I had done.

Strangely – and I haven’t tested this idea to destruction – there seems to be some way in which the ordering of the lengths of the pname values affect which ones are used. When I did some repeat tests (by which time I’d restarted the instance) I found two options in 11g that would trace the processes I wanted traced:

  • “process:pname = p00 | dw | dm”
  • “process:pname = dw | dm | p”

You’ll notice that in both cases the process name “templates” are in order of descending length. The second option is highly undesirable, of course, since this also enables tracing of pmon.

There’s room for further investigation, but I’ve done what I wanted to do, got a note of the anomaly, and don’t expect to do something like this again on 11g.

I will point out, however, that I got the original basic form for the process parameter by descending through the oradebug doc tree in 11g and 12cR1; but the content reported from oradebug doc in 12cR2 shows the form that I reported for 19c. If you check through MOS, though you’ll find a third form which works in 11g (and, possibly, later):

alter system set events 'sql_trace {process : pname = dw | pname = dm | pname = p00} level=8';

September 29, 2022

Case Study

Filed under: Execution plans,Oracle,Performance,Troubleshooting — Jonathan Lewis @ 6:27 pm BST Sep 29,2022

A recent question on the Oracle Developer Community forum asked for help with a statement that was taking a long time to run. The thread included the results from a trace file that had been passed through tkprof so we have the query and the actual execution plan with the Row Source Operation details.

Here’s the query – extracted from the tkprof output:

SELECT DISTINCT
       pll.po_line_id,
       ploc.line_location_id,
       (SELECT ptl.line_type
          FROM apps.po_line_types_tl ptl
         WHERE ptl.line_type_id = pll.line_type_id AND ptl.LANGUAGE = 'US')
           "Line_Type",
       ploc.quantity_accepted,
       NULL
           release_approved_date,
       NULL
           release_date,
       NULL
           release_hold_flag,
       NULL
           release_type,
       DECODE (ploc.po_release_id, NULL, NULL, ploc.quantity)
           released_quantity,
       (SELECT items.preprocessing_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "PreProcessing_LT",
       (SELECT items.full_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "Processing_LT",
       (SELECT items.postprocessing_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "PostProcessing_LT",
       ploc.firm_status_lookup_code,
       NVL (
           (SELECT pla.promised_date
              FROM apps.po_line_locations_archive_all pla
             WHERE     pla.po_header_id = pha.po_header_id
                   AND pla.po_line_id = pll.po_line_id
                   AND pla.line_location_id = ploc.line_location_id
                   AND pla.revision_num =
                       (SELECT MIN (revision_num)
                          FROM apps.po_line_locations_archive_all plla2
                         WHERE     plla2.promised_date IS NOT NULL
                               AND plla2.line_location_id =
                                   ploc.line_location_id)),
           ploc.promised_date)
           "Original_Promise_Date",
       (SELECT items.long_description
          FROM apps.mtl_system_items_tl items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id IN
                       (SELECT fin.inventory_organization_id
                          FROM apps.financials_system_params_all fin
                         WHERE fin.org_id = pha.org_id)
               AND items.LANGUAGE = 'US')
           "Item_Long_Description",
       NVL (ploc.approved_flag, 'N')
           approved_code,
       pvs.country
           "Supplier_Site_Country",
       pll.note_to_vendor,
         NVL (ploc.quantity, 0)
       - NVL (ploc.quantity_cancelled, 0)
       - NVL (ploc.quantity_received, 0) * ploc.price_override
           "Shipment_Amount",
       ploc.attribute4
           "PO_Ship_Date",
       (SELECT meaning
          FROM apps.fnd_lookup_values
         WHERE     lookup_type = 'SHIP_METHOD'
               AND lookup_code = ploc.attribute9
               AND language = 'US')
           "Ship_Method",
       (SELECT prla.note_to_receiver
          FROM apps.po_req_distributions_all  prda
               INNER JOIN apps.po_requisition_lines_all prla
                   ON prda.requisition_line_id = prla.requisition_line_id
         WHERE prda.distribution_id = pdi.req_distribution_id)
           "Note_To_Receiver",
       DECODE (pha.USER_HOLD_FLAG, 'Y', 'Y', pll.USER_HOLD_FLAG)
           "Hold_Flag",
       (SELECT ABC_CLASS_NAME
          FROM APPS.MTL_ABC_ASSIGNMENT_GROUPS  ASG
               INNER JOIN APPS.MTL_ABC_ASSIGNMENTS ASSI
                   ON ASG.ASSIGNMENT_GROUP_ID = ASSI.ASSIGNMENT_GROUP_ID
               INNER JOIN APPS.MTL_ABC_CLASSES classes
                   ON ASSI.ABC_CLASS_ID = classes.ABC_CLASS_ID
         WHERE     ASG.organization_id = ploc.SHIP_TO_ORGANIZATION_ID
               AND ASG.ASSIGNMENT_GROUP_NAME = 'MIN ABC Assignment'
               AND ASSI.inventory_item_id = pll.item_id)
           ABCClass,
       (SELECT CONCATENATED_SEGMENTS AS charge_accountsfrom
          FROM apps.gl_code_combinations_kfv gcc
         WHERE gcc.code_combination_id = pdi.code_combination_id)
           AS charge_accounts
  FROM apps.po_headers_all         pha,
       apps.po_lines_all           pll,
       apps.po_line_locations_all  ploc,
       apps.po_distributions_all   pdi,
       apps.per_all_people_f       papf,
       apps.AP_SUPPLIERS           pv,
       apps.AP_SUPPLIER_SITES_ALL  pvs,
       apps.AP_SUPPLIER_CONTACTS   pvc,
       apps.ap_terms               apt,
       apps.po_lookup_codes        plc1,
       apps.po_lookup_codes        plc2,
       apps.hr_locations           hlv_line_ship_to,
       apps.hr_locations           hlv_ship_to,
       apps.hr_locations           hlv_bill_to,
       apps.hr_organization_units  hou,
       apps.hr_locations_no_join   loc,
       apps.hr_locations_all_tl    hrl1,
       apps.hr_locations_all_tl    hrl2
 WHERE     1 = 1
       AND pll.po_header_id(+) = pha.po_header_id
       AND ploc.po_line_id(+) = pll.po_line_id
       AND pdi.line_location_id(+) = ploc.line_location_id
       AND ploc.shipment_type IN ('STANDARD', 'PLANNED')
       AND papf.person_id(+) = pha.agent_id
       AND TRUNC (SYSDATE) BETWEEN papf.effective_start_date
                               AND papf.effective_end_date
       AND papf.employee_number IS NOT NULL
       AND pv.vendor_id(+) = pha.vendor_id
       AND pvs.vendor_site_id(+) = pha.vendor_site_id
       AND pvc.vendor_contact_id(+) = pha.vendor_contact_id
       AND apt.term_id(+) = pha.terms_id
       AND plc1.lookup_code(+) = pha.fob_lookup_code
       AND plc1.lookup_type(+) = 'FOB'
       AND plc2.lookup_code(+) = pha.freight_terms_lookup_code
       AND plc2.lookup_type(+) = 'FREIGHT TERMS'
       AND hlv_line_ship_to.location_id(+) = ploc.ship_to_location_id
       AND hlv_ship_to.location_id(+) = pha.ship_to_location_id
       AND hlv_bill_to.location_id(+) = pha.bill_to_location_id
       AND hou.organization_id = pha.org_id
       AND hou.location_id = loc.location_id(+)
       AND hrl1.location_id(+) = pha.ship_to_location_id
       AND hrl1.LANGUAGE(+) = 'US'
       AND hrl2.location_id(+) = pha.bill_to_location_id
       AND hrl2.LANGUAGE(+) = 'US'
       AND hou.organization_id IN (2763)
       AND NVL (pha.closed_code, 'OPEN') IN ('OPEN', 'CLOSED')
       AND NVL (pll.closed_code, 'OPEN') IN ('OPEN', 'CLOSED')
       AND NVL (ploc.cancel_flag, 'N') = 'N'
       AND pha.authorization_status IN
               ('APPROVED', 'REQUIRES REAPPROVAL', 'IN PROCESS')

As you can see there are 10 inline scalar subqueries (highlighted) in the query with a select distinct to finish off the processing of an 18 table join. That’s a lot of scalar subqueries so it’s worth asking whether the code should be rewritten to use joins (though in newer vesions of Oracle some of the subqueries might be transformed to outer joins anyway – but the OP is using 11.2.0.4). We also know that a distinct is sometimes a hint that the code has a logic error that has been “fixed” by eliminating duplicates.

Ignoring those points, let’s consider the execution plan from the tkprof output which (with a tiny bit of extra formatting) is as follows:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.46       1.75          0          3          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    50346    279.02    1059.39     179103   30146895          0      755164
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    50348    279.49    1061.14     179103   30146898          0      755164

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 678  
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         9          9          9  TABLE ACCESS BY INDEX ROWID PO_LINE_TYPES_TL (cr=20 pr=0 pw=0 time=680 us cost=2 size=32 card=1)
         9          9          9   INDEX UNIQUE SCAN PO_LINE_TYPES_TL_U1 (cr=11 pr=0 pw=0 time=323 us cost=1 size=0 card=1)(object id 63682480)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267756 pr=28 pw=0 time=22598079 us cost=4 size=13 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720936 pr=0 pw=0 time=4644552 us cost=3 size=0 card=1)(object id 42812859)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267747 pr=0 pw=0 time=2442479 us cost=4 size=13 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720936 pr=0 pw=0 time=1238342 us cost=3 size=0 card=1)(object id 42812859)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267743 pr=0 pw=0 time=2029190 us cost=4 size=14 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720932 pr=0 pw=0 time=967729 us cost=3 size=0 card=1)(object id 42812859)

    672743     672743     672743  TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=5507736 pr=163043 pw=0 time=535914552 us cost=3 size=27 card=1)
    672743     672743     672743   INDEX UNIQUE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=4560824 pr=163043 pw=0 time=533161038 us cost=2 size=0 card=1)(object id 42811947)
    755121     755121     755121    SORT AGGREGATE (cr=3540960 pr=163043 pw=0 time=530079821 us)
   1040963    1040963    1040963     TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=3540960 pr=163043 pw=0 time=534243973 us cost=5 size=15 card=1)
   1776649    1776649    1776649      INDEX RANGE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=1123074 pr=6392 pw=0 time=37128373 us cost=3 size=0 card=2)(object id 42811947)

    587486     587486     587486  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_TL (cr=3436629 pr=3564 pw=0 time=64125044 us cost=5 size=34 card=1)
    587486     587486     587486   INDEX RANGE SCAN MTL_SYSTEM_ITEMS_TL_U1 (cr=2852930 pr=869 pw=0 time=45628505 us cost=4 size=0 card=1)(object id 136492495)
         1          1          1    TABLE ACCESS BY INDEX ROWID FINANCIALS_SYSTEM_PARAMS_ALL (cr=645351 pr=0 pw=0 time=5743158 us cost=2 size=10 card=1)
    322268     322268     322268     INDEX SKIP SCAN FINANCIALS_SYSTEM_PARAMS_U1 (cr=323083 pr=0 pw=0 time=5104895 us cost=1 size=0 card=1)(object id 42770563)

        10         10         10  TABLE ACCESS BY INDEX ROWID FND_LOOKUP_VALUES (cr=51 pr=1 pw=0 time=3620 us cost=5 size=60 card=1)
        20         20         20   INDEX RANGE SCAN FND_LOOKUP_VALUES_X99 (cr=31 pr=1 pw=0 time=2133 us cost=4 size=0 card=1)(object id 42759866)

    634276     634276     634276  NESTED LOOPS  (cr=3540930 pr=5535 pw=0 time=181518759 us cost=5 size=28 card=1)
    634276     634276     634276   TABLE ACCESS BY INDEX ROWID PO_REQ_DISTRIBUTIONS_ALL (cr=1631471 pr=5253 pw=0 time=65405333 us cost=3 size=12 card=1)
    634276     634276     634276    INDEX UNIQUE SCAN PO_REQ_DISTRIBUTIONS_U1 (cr=994522 pr=5252 pw=0 time=31023194 us cost=2 size=0 card=1)(object id 42788583)
    634276     634276     634276   TABLE ACCESS BY INDEX ROWID PO_REQUISITION_LINES_ALL (cr=1909459 pr=282 pw=0 time=115275921 us cost=2 size=16 card=1)
    634276     634276     634276    INDEX UNIQUE SCAN PO_REQUISITION_LINES_U1 (cr=944449 pr=268 pw=0 time=12285440 us cost=1 size=0 card=1)(object id 42789681)

    511989     511989     511989  NESTED LOOPS  (cr=3533763 pr=6 pw=0 time=8999321 us cost=5 size=55 card=1)
    511989     511989     511989   NESTED LOOPS  (cr=2850293 pr=6 pw=0 time=7086027 us cost=4 size=45 card=1)
    576055     576055     576055    TABLE ACCESS BY INDEX ROWID MTL_ABC_ASSIGNMENT_GROUPS (cr=612378 pr=0 pw=0 time=2002832 us cost=2 size=29 card=1)
    576055     576055     576055     INDEX UNIQUE SCAN MTL_ABC_ASSIGNMENT_GROUPS_U2 (cr=36323 pr=0 pw=0 time=951307 us cost=1 size=0 card=1)(object id 42783622)
    511989     511989     511989    TABLE ACCESS BY INDEX ROWID MTL_ABC_ASSIGNMENTS (cr=2237915 pr=6 pw=0 time=4672006 us cost=3 size=16 card=1)
    511989     511989     511989     INDEX UNIQUE SCAN MTL_ABC_ASSIGNMENTS_U1 (cr=1551490 pr=4 pw=0 time=2533524 us cost=2 size=0 card=1)(object id 42757737)
    511989     511989     511989   TABLE ACCESS BY INDEX ROWID MTL_ABC_CLASSES (cr=683470 pr=0 pw=0 time=1488045 us cost=1 size=10 card=1)
    511989     511989     511989    INDEX UNIQUE SCAN MTL_ABC_CLASSES_U1 (cr=171481 pr=0 pw=0 time=693745 us cost=0 size=0 card=1)(object id 42789694)

     13320      13320      13320  TABLE ACCESS BY INDEX ROWID GL_CODE_COMBINATIONS (cr=34801 pr=0 pw=0 time=802675 us cost=3 size=49 card=1)
     13320      13320      13320   INDEX UNIQUE SCAN GL_CODE_COMBINATIONS_U1 (cr=21481 pr=0 pw=0 time=397344 us cost=2 size=0 card=1)(object id 42775044)


    755164     755164     755164  HASH UNIQUE (cr=30147018 pr=179103 pw=0 time=1058922684 us cost=749257 size=197349453 card=482517)
    768890     768890     768890   HASH JOIN  (cr=7289842 pr=6926 pw=0 time=244582512 us cost=696202 size=197349453 card=482517)
    140451     140451     140451    TABLE ACCESS FULL PER_ALL_PEOPLE_F (cr=38207 pr=0 pw=0 time=313692 us cost=18484 size=13278261 card=428331)
    768890     768890     768890    NESTED LOOPS OUTER (cr=7251635 pr=6926 pw=0 time=242897348 us cost=672652 size=30016980 card=79410)
    755121     755121     755121     NESTED LOOPS OUTER (cr=5538283 pr=6031 pw=0 time=154841427 us cost=443987 size=28382903 card=78623)
    755121     755121     755121      NESTED LOOPS OUTER (cr=5508916 pr=6031 pw=0 time=153523676 us cost=443982 size=18184959 card=51809)
    755121     755121     755121       NESTED LOOPS OUTER (cr=5386279 pr=6031 pw=0 time=151985656 us cost=443978 size=11642422 card=34142)
    755121     755121     755121        NESTED LOOPS  (cr=5090949 pr=6031 pw=0 time=139220421 us cost=375644 size=11574138 card=34142)
    792959     792959     792959         NESTED LOOPS  (cr=1747964 pr=134 pw=0 time=64597738 us cost=109035 size=19934760 card=73560)
    254919     254919     254919          HASH JOIN OUTER (cr=315780 pr=6 pw=0 time=14811187 us cost=29121 size=5413350 card=22650)
    254919     254919     254919           NESTED LOOPS OUTER (cr=286919 pr=0 pw=0 time=12395919 us cost=13792 size=5209500 card=22650)
    254919     254919     254919            HASH JOIN RIGHT OUTER (cr=107134 pr=0 pw=0 time=12153146 us cost=13790 size=3868572 card=17426)
      3834       3834       3834             VIEW  HR_LOCATIONS (cr=3913 pr=0 pw=0 time=15826 us cost=125 size=360 card=60)
      3834       3834       3834              NESTED LOOPS  (cr=3913 pr=0 pw=0 time=15055 us cost=125 size=1080 card=60)
      3834       3834       3834               TABLE ACCESS FULL HR_LOCATIONS_ALL (cr=262 pr=0 pw=0 time=11211 us cost=125 size=304 card=38)
      3834       3834       3834               INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=3651 pr=0 pw=0 time=6183 us cost=0 size=20 card=2)(object id 42783719)
    254919     254919     254919             HASH JOIN RIGHT OUTER (cr=103221 pr=0 pw=0 time=11917174 us cost=13666 size=3764016 card=17426)
      3834       3834       3834              VIEW  HR_LOCATIONS (cr=3898 pr=0 pw=0 time=14651 us cost=125 size=360 card=60)
      3834       3834       3834               NESTED LOOPS  (cr=3898 pr=0 pw=0 time=14267 us cost=125 size=1080 card=60)
      3834       3834       3834                TABLE ACCESS FULL HR_LOCATIONS_ALL (cr=247 pr=0 pw=0 time=9532 us cost=125 size=304 card=38)
      3834       3834       3834                INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=3651 pr=0 pw=0 time=9539 us cost=0 size=20 card=2)(object id 42783719)
    254919     254919     254919              HASH JOIN RIGHT OUTER (cr=99323 pr=0 pw=0 time=11817243 us cost=13541 size=3659460 card=17426)
        45         45         45               INDEX RANGE SCAN FND_LOOKUP_VALUES_U1 (cr=21 pr=0 pw=0 time=614 us cost=4 size=49 card=1)(object id 63685210)
    254919     254919     254919               HASH JOIN RIGHT OUTER (cr=99302 pr=0 pw=0 time=11729251 us cost=13537 size=2805586 card=17426)
        59         59         59                INDEX RANGE SCAN FND_LOOKUP_VALUES_U1 (cr=20 pr=0 pw=0 time=445 us cost=4 size=49 card=1)(object id 63685210)
    254919     254919     254919                NESTED LOOPS  (cr=99282 pr=0 pw=0 time=11653162 us cost=13533 size=1951712 card=17426)
         1          1          1                 NESTED LOOPS OUTER (cr=116 pr=0 pw=0 time=113273 us cost=3 size=40 card=1)
         1          1          1                  NESTED LOOPS  (cr=113 pr=0 pw=0 time=113227 us cost=2 size=32 card=1)
         1          1          1                   INDEX UNIQUE SCAN HR_ALL_ORGANIZATION_UNTS_TL_PK (cr=110 pr=0 pw=0 time=113164 us cost=1 size=17 card=1)(object id 63680720)
         1          1          1                   TABLE ACCESS BY INDEX ROWID HR_ALL_ORGANIZATION_UNITS (cr=3 pr=0 pw=0 time=59 us cost=1 size=15 card=1)
         1          1          1                    INDEX UNIQUE SCAN HR_ORGANIZATION_UNITS_PK (cr=2 pr=0 pw=0 time=7 us cost=0 size=0 card=1)(object id 42789144)
         1          1          1                  TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=3 pr=0 pw=0 time=42 us cost=1 size=8 card=1)
         1          1          1                   INDEX UNIQUE SCAN HR_LOCATIONS_PK (cr=2 pr=0 pw=0 time=7 us cost=0 size=0 card=1)(object id 42797079)
    254919     254919     254919                 TABLE ACCESS BY INDEX ROWID PO_HEADERS_ALL (cr=99166 pr=0 pw=0 time=11505632 us cost=13530 size=1254672 card=17426)
    255397     255397     255397                  INDEX SKIP SCAN PO_HEADERS_ALL_X3 (cr=1753 pr=0 pw=0 time=725236 us cost=352 size=0 card=37674)(object id 42773719)
    254883     254883     254883            INDEX UNIQUE SCAN AP_TERMS_TL_U1 (cr=179785 pr=0 pw=0 time=183291 us cost=0 size=8 card=1)(object id 42798416)
    482528     482528     482528           TABLE ACCESS FULL AP_SUPPLIER_SITES_ALL (cr=28861 pr=6 pw=0 time=227983 us cost=13727 size=4323123 card=480347)
    792959     792959     792959          TABLE ACCESS BY INDEX ROWID PO_LINES_ALL (cr=1432184 pr=128 pw=0 time=53002963 us cost=5 size=96 card=3)
    793375     793375     793375           INDEX RANGE SCAN PO_LINES_U2 (cr=504726 pr=20 pw=0 time=17603112 us cost=2 size=0 card=5)(object id 42755253)
    755121     755121     755121         TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ALL (cr=3342985 pr=5897 pw=0 time=71357938 us cost=4 size=68 card=1)
   1138558    1138558    1138558          INDEX RANGE SCAN PO_LINE_LOCATIONS_N15 (cr=1707311 pr=5830 pw=0 time=37903421 us cost=3 size=0 card=2)(object id 63697005)
    723002     723002     723002        VIEW PUSHED PREDICATE  HR_LOCATIONS (cr=295330 pr=0 pw=0 time=11391536 us cost=2 size=2 card=1)
    723002     723002     723002         NESTED LOOPS  (cr=295330 pr=0 pw=0 time=11004720 us cost=2 size=18 card=1)
    723002     723002     723002          INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=146911 pr=0 pw=0 time=1391389 us cost=1 size=10 card=1)(object id 42783719)
    723002     723002     723002          TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=148419 pr=0 pw=0 time=9233363 us cost=1 size=8 card=1)
    723002     723002     723002           INDEX UNIQUE SCAN HR_LOCATIONS_PK (cr=117800 pr=0 pw=0 time=836734 us cost=0 size=0 card=1)(object id 42797079)
    755119     755119     755119       INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=122637 pr=0 pw=0 time=829404 us cost=0 size=20 card=2)(object id 42783719)
    755121     755121     755121      INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=29367 pr=0 pw=0 time=716408 us cost=0 size=20 card=2)(object id 42783719)
    768883     768883     768883     TABLE ACCESS BY INDEX ROWID PO_DISTRIBUTIONS_ALL (cr=1713352 pr=895 pw=0 time=75314769 us cost=3 size=17 card=1)
    768883     768883     768883      INDEX RANGE SCAN PO_DISTRIBUTIONS_N1 (cr=1096671 pr=874 pw=0 time=24392643 us cost=2 size=0 card=1)(object id 42782429)

The plan is a bit long, but you may recall that a query with scalar subqueries in the select list reports the plans for each of the separate scalar subqueries before reporting the main query block – and I’ve inserted blank lines in the output above to improve the visibility of the individual blocks / scalar subqueries.

An odd little detail of this tkprof output was that there was no report of the wait information recorded against the query, though the following information appeared as the summary for the trace file, giving us a very good idea of the wait events for the individual query:

OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        6      0.85       2.14          0          6          0           0
Execute      6      0.00       0.00          0          7        104          85
Fetch    50358    279.03    1059.39     179103   30146895          0      755329
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    50370    279.88    1061.54     179103   30146908        104      755414

Misses in library cache during parse: 3

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   50363        0.00          0.00
  SQL*Net message from client                 50362      157.17        227.70
  row cache lock                                141        0.03          0.67
  library cache lock                             77        0.01          0.21
  library cache pin                              75        0.01          0.27
  Disk file operations I/O                      791        0.00          0.01
  gc current block 3-way                     835881        0.15        305.35
  gc current block 2-way                     471360        0.24        144.04
  KJC: Wait for msg sends to complete            40        0.00          0.00
  gc cr multi block request                       8        0.00          0.00
  gc current block congested                  10014        0.03          4.23
  gc cr block 3-way                           20215        0.06          4.69
  gc current grant busy                          20        0.00          0.00
  gc cr grant 2-way                          165010        0.07         25.13
  db file sequential read                    179103        0.05        196.31
  gc cr grant congested                         729        0.19          0.36
  gc current block busy                       71431        0.05        118.15
  gc cr block 2-way                            1800        0.01          0.31
  latch free                                      3        0.00          0.00
  gc cr block congested                         197        0.01          0.06
  latch: cache buffers chains                    45        0.00          0.00
  latch: gc element                              15        0.00          0.00
  gc cr block busy                               15        0.02          0.07
  latch: object queue header operation            1        0.00          0.00
  KSV master wait                                 2        0.00          0.00
  ASM file metadata operation                     1        0.00          0.00
  SQL*Net more data to client                     1        0.00          0.00
  gc current grant 2-way                          6        0.00          0.00

An important initial observation is that the query returned 750,000 rows in 50,000 fetches (all figures rounded for convenience) and that’s consistent with the SQL*Plus default arraysize of 15. So there might be a little time saved by setting the arraysize to a larger value (but only a few 10s of seconds – based on the 227 seconds total minus the 157 second maximum wait for the “SQL*Net message from client” figures and there may be some benefit of increasing the SQL*Net SDU_SIZE at the same time). Critically, though, we should ask “why do you want a query to return 750,000 rows?”, and “how fast do you think is ‘reasonable’?” You’ll also note from the “gc” waits that the system is based on RAC with at least 3 nodes – and RAC is always a suspect when you see unexpected time spent in a query.

Where in the driving query block does most of the time go between the last hash join (line 62) and the hash unique (line 61) – it’s in the query block whose plan starts at line 28 where we see 163,000 physical blocks read (pr=) and 535 seconds (time= microseconds) of which 6,400 blocks come from the index range scan operation at line 32 but most comes from line 31 fetching 1 million rows (by index rowid) from table po_lines_locations_archive_all.

    672743     672743     672743  TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=5507736 pr=163043 pw=0 time=535914552 us cost=3 size=27 card=1)
    672743     672743     672743   INDEX UNIQUE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=4560824 pr=163043 pw=0 time=533161038 us cost=2 size=0 card=1)(object id 42811947)
    755121     755121     755121    SORT AGGREGATE (cr=3540960 pr=163043 pw=0 time=530079821 us)
   1040963    1040963    1040963     TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=3540960 pr=163043 pw=0 time=534243973 us cost=5 size=15 card=1)
   1776649    1776649    1776649      INDEX RANGE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=1123074 pr=6392 pw=0 time=37128373 us cost=3 size=0 card=2)(object id 42811947)

This part of the workload comes from 672,743 executions of the subquery starting at line 36 of the original query text:

           (SELECT pla.promised_date
              FROM apps.po_line_locations_archive_all pla
             WHERE     pla.po_header_id = pha.po_header_id
                   AND pla.po_line_id = pll.po_line_id
                   AND pla.line_location_id = ploc.line_location_id
                   AND pla.revision_num =
                       (SELECT MIN (revision_num)
                          FROM apps.po_line_locations_archive_all plla2
                         WHERE     plla2.promised_date IS NOT NULL
                               AND plla2.line_location_id =
                                   ploc.line_location_id))

If we want to improve the performance of this query with a minimum of re-engineering, recoding and risk then a good point to start would be to examine this query block in isolation and see if there is a simple, low-cost way of improving its efficiency. (Note: this may not be a route to optimising the whole query “properly”, but it may give a quick win that is “good enough”.)

We could go a little further down this route of optimising the scalar subqueries by looking at the time spent in each of them in turn. Taking out the top line of each of the separate sections of the plan and extracting just the pr, pw and time values (which I’ll scale back from microseconds to seconds) we get the following

pr=      0      pw=0    time=   0
pr=     28      pw=0    time=  23
pr=      0      pw=0    time=   2
pr=      0      pw=0    time=   2
pr= 163043      pw=0    time= 536
pr=   3564      pw=0    time=  64
pr=      1      pw=0    time=   0
pr=   5535      pw=0    time= 182
pr=      6      pw=0    time=   9
pr=      0      pw=0    time=   1

The 8th scalar subquery (line 42 in the plan, line 75 in the query) gives us an opportunity to reduce the run time by 182 seconds, so might be worth a little investment in programmer time.

The 6th subquery (line 34 in the plan, line 49 in the query) adds only 64 seconds to the run time, so we might be less inclined to do anything about it.

You might note that the 2nd, 3rd and 4th subqueries are against the same table with the same predicate to get three different columns – this group is the “obvious” choice for recoding as a single join rather than three separate subqueries, but if you look at the total times of the three subqueries the “extra” two executions add only two seconds each to the total time – so although this scalar subquery coding pattern is undesirable, it’s not necessarily going to be worth expending the effort to rewrite it in this case.

If you’re wondering, by the way, why different subqueries are reporting different numbers of rows returned (and each one should return at most one row on each execution), there are two reasons for any subquery to be reporting fewer than the 768,890 rows reported by the basic driving hash join:

  1. an execution may simply return no rows,
  2. there may be some benefits from scalar subquery caching.

One of the nice details about newer versions of Oracle is that the “starts” statistic is also reported in the trace/tkprof output so you would be able to see how much your query had benefited from scalar subquery caching.

If we add together the time reported by each of the scalar subquery sections of the plan the total time reported is approximately 819 seconds. Cross-checking with the difference in the times reported for operations 61 and 62 (hash unique of hash join) we see: 1,059 seconds – 245 seconds = 814 seconds. This is a good match (allowing for the accumulation of a large number of small errors) for the 819 seconds reported in the subqueries – so the hash unique isn’t a significant part of the query even though it has virtually no effect on the volume of data. You’ll note that it didn’t spill to disc (pw = 0) but completed in memory.

Summary

I’ve written a quick note on this query because the coding style was undesirable and the execution plan quite lengthy. I’ve reviewed how the style of the SQL is echoed in the shape of the plan. I’ve then pursued the idea of optimising the code “piece-wise” to see if there were any opportunities for improving the performance “enough” without going through the effort of a complete redesign of the query. [Update: One of the participants in the thread is currently walking through the mechanics of manually unnesting the most expensive scalar subquery into an outer join.]

Given the information in the Row Source Operation section of the tkprof output it proved easy to identify where the largest amounts of times appeared that might be reduced by localised optimsation.

In passing I pointed out the possibility of reducing the time spent on network traffic by increasing the array fetch size, and increasing the SDU_SIZE (at both ends of the connection) for the SQL*Net messages to client.

Footnote (addendum)

I made a passing reference to the waits that told us that the user was running RAC. These waits merit a few follow-up comments.

The numbers for “gc” waits are high. Of particular note are the 71,000 waits and 118 seconds waited on “gc current block busy” which wave a big red flag telling us that there’s too much DML modifying the same object(s) from multiple nodes at the same time. (The even larger numbers for the “gc current block 2/3- way” say the same, but “busy” really emphasises the “hot-spot” aspect of the problem.)

Ideally we would like to see exactly where in the execution plan the bulk of those waits is occurring and, since the OP has been able to supply a trace file for the query, it’s possible that the query can be re-run to produce the SQL Monitor report (if the OP is suitably licenced) that summarises the Active Session History (ASH) for each line of the plan.

If the ASH data were available for a run of the report we could then do some analysis of parameter values recorded in v$active_session_history to see if that supplied further information. Unfortunately the view v$event_name doesn’t tell us what the parameter values mean for most of the “gc current%” waits, but a couple of the ones that do have descriptions report parameter1 as the file# and parameter2 as the block#, so maybe that’s true for many of them. (For some of the waits parameter1 is listed as the “le” (lock element), which doesn’t really help very much.)

Another diagnostic that could be very helpful is to take a snapshot of the session activity stats (v$sesstat) for the session as this might tell us that part of the load comes from “unexpected” work going on. In particular if we do an analysis of the “db file sequential read” waits we may find that many of the waits are for blocks in the undo tablespace, which would prompt us to examine the session stats to see what they tell us through the “% – undo records applied” statistics.

As a final comment – linking the “gc” activity back to my comment “That’s a lot of scalar subqueries so it’s worth asking whether the code should be rewritten to use joins” – if you add an extra table to a query with a simple join condition to add columns from that table to the query select list then Oracle can pin various index blocks; if you replace the join with a scalar subquery (which means you’re after just one column from one row each time) then Oracle has to pin and release all the index blocks on each call to the subquery. The benefit of the subquery approach is that scalar subquery caching may mean the subquery is rarely executed (check the highly suggestive stats in the plan for the first and seventh scalar subquery blocks – lines 16 and 39); the downside to the subquery approach is that you may end up spending a lot more time in buffer cache handling which, for RAC, includes the global cache (gc) management.

September 12, 2022

Dumping redo

Filed under: Infrastructure,Oracle,redo,Troubleshooting — Jonathan Lewis @ 10:05 am BST Sep 12,2022

In the past I’ve sometimes had to dump the contents of the redo log to a trace file when I needed to find out what work Oracle was doing behing the scenes. To minimise the volume dumped by the “alter system dump logfile” command and make it easier to find the bit I wanted to see I used to “switch logfile” just before (and sometimes just after) the statement I was investigating.

With the advent of pluggable databases the “switch logfile” command now raises Oracle error: “ORA-65040: operation not allowed from within a pluggable database”, so I had to change the strategy. This is just a brief note (echoing a footnote to an older note) of the approach I now use:

column current_scn new_value start_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

-- do something interesting here

column current_scn new_value end_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter session set tracefile_identifier='sometextyoulike';

alter system dump redo scn min &start_scn scn max &end_scn ;
alter session set tracefile_identifier='';

The list of options for the dump has been extended since I published the note on dumping the log file, and now (19.11.0.0) allows the following options (using c notation for the type of the variables you supply to each parameter):

 rdba min  %d rdba max  %d tablespace_no  %d
 dba min  %u  %u dba max  %u  %u
 securefile_dba  %u  %u
 length  %d
 time min  %d
 time max  %d
 layer  %d
 opcode  %d
 scn min  %llu
 scn max  %llu
 xid  %d  %d  %d
 objno  %u
 con_id  %d
 skip corruption


If you try to restrict the dump on objno (object id) or xid (transaction id) then the trace file will skip any redo records generated by private threads / in-memory undo and report the text: “Skipping IMU Redo Record: cannot be filtered by XID/OBJNO”

The tablespace_no option can only be used when both rdba min and rdba max (rolback data block address range) have been specified.

The con_id option may only be legal when used to specify a PDB from the CDB

Remember – when you dump redo you get just the redo for your session; there is some scope for being selective, but the starting point would be all the redo for the PDB you’re working from.

September 9, 2022

Parallel Default

Filed under: Oracle,Parallel Execution,Troubleshooting — Jonathan Lewis @ 10:25 am BST Sep 9,2022

“Why did my query go parallel?”

It’s a question that crops up from time to time, usually followed by a list of reasons why it shouldn’t have gone parallel – no hints in the query, table is not declared parallel, parallel_degree_policy is set to manual etc.

When the question appeared recently on the Oracle developer forum it turned out that the table in question was declared as “parallel (degree default)”, which prompted the OP to ask the question: “is parallel = default not equivalent to parallel = 1”.

The answer to the question is that the two options are not equivalent – but that’s not the point of this note. Here’s a little script to test the claim:

drop table t1 purge;

create table t1 pctfree 90 as select * from all_objects where rownum <= 50000;

select degree, instances from user_tables where table_name = 'T1';

explain plan for select sum(object_id) from t1;
select * from table(dbms_xplan.display);

alter table t1 parallel (degree default);
select degree, instances from user_tables where table_name = 'T1';

explain plan for select sum(object_id) from t1;
select * from table(dbms_xplan.display);


I’ve created a table in the simplest possible way, but picked a fixed number of rows (to help reproducibility) and – because parallel is usually about “big” objects – I’ve left a lot of empty space (90%) in each block.

Then I’ve checked the execution plan for a very simple query that can only do a full tablescan, with the two declarations of parallelism set.

Here are the outputs of the 4 queries I’ve run:

DEGREE                                   INSTANCES
---------------------------------------- ----------------------------------------
         1                                        1

1 row selected.


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  1275   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 50000 |   244K|  1275   (2)| 00:00:01 |
---------------------------------------------------------------------------

9 rows selected.


DEGREE                                   INSTANCES
---------------------------------------- ----------------------------------------
   DEFAULT                                        1

1 row selected.



PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3110199320

----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     5 |   350   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 50000 |   244K|   350   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 50000 |   244K|   350   (0)| 00:00:01 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 4 because of degree limit

17 rows selected.


Clearly “parallel default” does not have the same effect as “parallel 1”. Any time you’ve got a query unexpectedly running parallel it’s possible that some table (or index on the table) has been created with a parallel degree of default. (More commonly, someone may have rebuilt an index “parallel N” to get the job done more quickly then forgotten to alter the index back to parallel 1 – or noparallel – afterwards.)

The point of this note, though, is that there are some questions you should not ask until you’ve spent a few minutes thinking about how you might create a model that gives you the answer. There are several reasons for this

  • The more you do it, the better and faster you get at modelling and understanding – and sometimes you really need to model a complex problem because you’re not allowed to show anything that looks like production in public.
  • If the simple model seems to disagree with the behaviour you see in production it may give you some clues about where to look in the production system for the source of the difference.
  • If the answer isn’t what you thought it would be you can change the question you put publicly to: “I thought Oracle would do X but it did Y; here’s how I tested, is there a flaw in the test?”

It took about 5 minutes for me to run up this demo – that might seem a bit quick but I’ve had a lot of practice (and it took a lot longer to write the note) – and it was, in this case, a waste of my time because I knew the answer; but I often run up little models before responding to questions on the forums or listservers because while I often think I know what the answer “ought” to be I do like to check before I say something that might be incorrect.

September 2, 2022

Shrinking indexes

Filed under: fragmentation,Index Rebuilds,Indexing,Infrastructure,Oracle — Jonathan Lewis @ 7:21 pm BST Sep 2,2022

If you want to do something about “wasted” space in an index what are the differences that you need to consider between the following three options (for the purposes of the article I’m ignoring “rebuild” and “rebuild online”):

alter index xxx coalesce;

alter index xxx shrink space compact;

alter index xxx shrink space;

Looking at the notes in a script I wrote a “few” years ago it seems that I haven’t looked at a comparison between the coalesce option and the shrink space options since 10.2.0.3 and I suspect things may have changed since then, so I’ve discarded the results that I had recorded (in 2012) and started again with 19.11.0.0

Background

I’ve been looking at the “deferred global index maintenance” in the last couple of weeks which is why I was toying with the idea of writing something about shrinking indexes and how it differs from coalescing them when an Oracle Forum question (needs MOS account) produced the (slightly surprising) suggestion to use coalesce – so I decided it was time to (re-)test, write and publish.

RTFM

First a few bullet points from the 19c SQL reference manual under “alter index”, or following the links from there to the “shrink clause”, or the database administration reference

  • Specify COALESCE to instruct Oracle Database to merge the contents of index blocks where possible to free blocks for reuse.
  • Use this [shrink] clause to compact the index segments. Specifying ALTER INDEXSHRINK SPACE COMPACT is equivalent to specifying ALTER INDEXCOALESCE.
    • If you specify COMPACT, then Oracle Database only defragments the segment space … The database does not readjust the high water mark and does not release the space immediately.
  • Can’t shrink space for bitmap join indexes or function-based indexes.
  • Segment shrink is an online, in-place operation. DML operations and queries can be issued during the data movement phase of segment shrink. Concurrent DML operations are blocked for a short time at the end of the shrink operation when the space is deallocated.
  • Shrink operations can be performed only on segments in locally managed tablespaces with automatic segment space management (ASSM).
  • As with other DDL operations, segment shrink causes subsequent SQL statements to be reparsed because of invalidation of cursors unless you specify the COMPACT clause.

As with many little features of Oracle it’s quite hard to pick up a complete and cohesive statement of what something does and what impact it might have. Some of the bullet points above are generic about shrinking segments, and may not be totally accurate for shrinking only an index – will it invalidate cursors, or does that happen only when you shrink a table used by the cursor, or only when you shrink an index that’s used by the cursor.

If you do read through the links you also notice that I’ve omitted several points from the generic shrink details that are not relevant for indexes (for example the requirement to enable row movement), and have only mentioned the restrictions which are explicitly referenced in the “shrink clause” for indexes.

What do we need to know?

Some of the fairly typical bits of information we might need to know about a “house-keeping” task like coalesce/shrink are:

  • How much work does it do, and of what type?
  • What exactly is the benefit we might get for the work done
  • What side-effects do we have to consider (locking, cursor invalidation etc.)
  • What side effects might show up if the process fails in mid-stream.

In the case of coalesce/shrink for indexes, a few specific questions would be:

  • Is “shrink space compact” really equivalent to “coalesce”
  • Are the operations “online” or only “nearly online”.
  • If shrink/coalesce is moving index entries around and moving index blocks around what happens if a session wants to insert an index entry into a leaf block that’s currently being “transferred” into another leaf block.
  • If it’s a big index that needs several minutes (or more) to shrink/coalesce, could ongoing transactions cause index leaf block splits that produce unexpected effects when Oracle tried to drop the highwater mark.
  • How big an index, and how long would the test have to take, and what degree of concurrency, and how (un)lucky would you have to be to hit a moment when something “strange” happened.

Finally – what tools would be helpful. Initially we might just look at:

  • session stats – to see what work we do
  • the dbms_space package – to check segment space usage pre and post.
  • the treedump event – to get a detailed picture of the index

Based on what we see we might feel the need to dig a little deeper with:

  • v$enqueue_stats
  • v$rollstat (rollback (undo) segment usage)
  • SQL tracing with wait states
  • Enqueue (lock) tracing
  • redo dumps

The basic model

Here’s a little script to create a model that we can use for testing. Because of the stated requirement of the shrink space command I’ll just point out that the default tablespace should be using automatic segment space management (ASSM), my tablespace is also defined to use 1MB uniform extents:

rem
rem     Script:         shrink_coalesce.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2012
rem
rem     Last tested:
rem             19.11.0.0
rem 

execute dbms_random.seed(0)

create table t1 (
        v1      varchar2(7)
);

create index t1_i1 on t1(v1);

begin
        for i in 1..1e6 loop
                insert into t1(v1) values(
                        to_char(1e6 + trunc(dbms_random.value(0,100000)))
                );
        end loop;
end;
/

commit;

column ind_id new_value m_ind_id

select  object_id ind_id
from    user_objects
where   object_name = 'T1_I1'
;

alter session set tracefile_identifier = 'creation';
alter session set events 'immediate trace name treedump level &m_ind_id';
alter system flush buffer_cache;

pause Check the tree dump and pick a leaf block to dump

-- alter system dump datafile &&datafile block &&block_id;
alter system dump datafile 36 block 5429;


prompt  ========================
prompt  Deleting 4 rows out of 5
prompt  ========================

delete  from t1 
where   mod(v1,5) != 0
;

commit;

alter session set tracefile_identifier = 'deletion';
alter session set events 'immediate trace name treedump level &m_ind_id';
alter system flush buffer_cache;

-- pause Check the tree dump and pick a leaf block to dump
-- alter system dump datafile &&datafile block &&block_id;
alter system dump datafile 36 block 5429;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1',
                cascade          => true
        );
end;
/

select
        rows_per_block,
        count(*)        block_count
from    (
        select
                /*+
                        dynamic_sampling(0)
                        index_ffs(t1,t1_i1)
                        noparallel_index(t,t1_i1)
                */
                sys_op_lbid( &m_ind_id ,'L',t1.rowid)   as block_id,
                count(*)                                as rows_per_block
        from
                t1
        group by
                sys_op_lbid( &m_ind_id ,'L',t1.rowid)
        )
group by
        rows_per_block
order by
        rows_per_block
;

@@dbms_space_use_assm_embedded test_user t1_i1 index

Unusually (for me) I’ve created the data by inserting rows one at a time after creating the index. This is to avoid starting from a “perfect” index i.e. one where the physical ordering of the leaf blocks is closely correlated with the logical ordering of the leaf blocks, and where the leaf blocks are very well packed.

With a single session inserting rows there will be a visible pattern to the choice that Oracle makes for “the next avilable free block” when it needs to do a leaf block split, but with the random value insertions there won’t be a pattern in “which block just split” so when you walk the index in key order the steps from one leaf block to the next will jump fairly randomly around the segment.

The table starts at 1,000,000 rows, but ends up with about 200,000 after deletion and an index where roughly 80% of the rows in each leaf block have been deleted. So that we know what state the tests start from I’ve done a treedump of the index before and after the delete (and included a pause in the script to allow you to find a dump to block from the treedump if you want to) with the following results:

Before:
----- begin tree dump
branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 401, level: 1)
      leaf: 0x900016d 150995309 (-1: row:222.222 avs:3778)
      leaf: 0x900154e 151000398 (0: row:218.218 avs:3854)
      leaf: 0x9000abd 150997693 (1: row:219.219 avs:3835)
      leaf: 0x900153e 151000382 (2: row:209.209 avs:4025)
      leaf: 0x900058d 150996365 (3: row:230.230 avs:3626)
      leaf: 0x90013a8 150999976 (4: row:229.229 avs:3645)
      leaf: 0x9000ae1 150997729 (5: row:411.411 avs:187)
      leaf: 0x900031c 150995740 (6: row:227.227 avs:3683)
      leaf: 0x90014d3 151000275 (7: row:229.229 avs:3645)
      leaf: 0x9000aec 150997740 (8: row:226.226 avs:3702)
      leaf: 0x90014f3 151000307 (9: row:226.226 avs:3702)
      leaf: 0x9000593 150996371 (10: row:219.219 avs:3835)
      leaf: 0x9001559 151000409 (11: row:223.223 avs:3759)
      leaf: 0x9000a9d 150997661 (12: row:210.210 avs:4006)
      leaf: 0x900152e 151000366 (13: row:215.215 avs:3911)
      leaf: 0x900018a 150995338 (14: row:258.258 avs:3094)
...


After:
----- begin tree dump
branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 401, level: 1)
      leaf: 0x900016d 150995309 (-1: row:222.47 avs:3778)
      leaf: 0x900154e 151000398 (0: row:218.52 avs:3854)
      leaf: 0x9000abd 150997693 (1: row:219.44 avs:3835)
      leaf: 0x900153e 151000382 (2: row:209.43 avs:4025)
      leaf: 0x900058d 150996365 (3: row:230.44 avs:3626)
      leaf: 0x90013a8 150999976 (4: row:229.45 avs:3645)
      leaf: 0x9000ae1 150997729 (5: row:411.88 avs:187)
      leaf: 0x900031c 150995740 (6: row:227.50 avs:3683)
      leaf: 0x90014d3 151000275 (7: row:229.42 avs:3645)
      leaf: 0x9000aec 150997740 (8: row:226.46 avs:3702)
      leaf: 0x90014f3 151000307 (9: row:226.57 avs:3702)
      leaf: 0x9000593 150996371 (10: row:219.46 avs:3835)
      leaf: 0x9001559 151000409 (11: row:223.54 avs:3759)
      leaf: 0x9000a9d 150997661 (12: row:210.33 avs:4006)
      leaf: 0x900152e 151000366 (13: row:215.30 avs:3911)
      leaf: 0x900018a 150995338 (14: row:258.52 avs:3094)
...
      leaf: 0x900077f 150996863 (398: row:356.64 avs:1232)
      leaf: 0x9000d67 150998375 (399: row:327.62 avs:1783)
   branch: 0x9000e45 150998597 (0: nrow: 378, level: 1)
      leaf: 0x900047a 150996090 (-1: row:342.86 avs:1498)
      leaf: 0x9000d46 150998342 (0: row:357.60 avs:1213)
...
...
      leaf: 0x9000607 150996487 (492: row:369.80 avs:985)
      leaf: 0x9000c60 150998112 (493: row:395.70 avs:491)
   branch: 0x9000c68 150998120 (6: nrow: 503, level: 1)
      leaf: 0x90001b2 150995378 (-1: row:235.60 avs:3531)
      leaf: 0x9001323 150999843 (0: row:230.54 avs:3626)

The “before” section is just the first few lines of 3,538 and shows us that we have a root block with 8 branch blocks (numbered from -1 to +6), and the first branch block holds 401 leaf blocks(numbered from -1 to 399), and the first leaf block starts with 222 index entries (in its row directory) of which, we learn from the “after” section, 47 (i.e. roughly 20%) are still “in use” after the delete. The “after” section adds in a few extra lines from the treedump, around branch block 0 and branch block 6.

In passing, if I were to execute a new transaction that inserted a new index entry into the first leaf block Oracle would tidy its directory and the start of the tree dump would look like the following:

 branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 401, level: 1)
      leaf: 0x900016d 150995309 (-1: row:48.48 avs:7084)
      leaf: 0x900154e 151000398 (0: row:218.52 avs:3854)

After the initial big insert many of the leaf blocks hold around 220 rows but we can see one leaf block of the initial 16 holding 411 rows. Allowing for the 9 blocks that aren’t leaf blocks we can calculate that we should see an average of approximately 1,000,000/3,529 = 283 rows per leaf block; the variation is the result of leaf block “50/50” splits. When a leaf block is full the next attempted insert causes Oracle to attach a new leaf block to the structure and share the existing entries fairly evenly between the two blocks (although there is one special case, the so-called “90/10” split that can happen when you insert a new high value into the highest value leaf block). The shares are not exactly equal because Oracle has to insert a new pointer in the parent branch block at the same time and may be able to reduce the size of this pointer by moving the split point some way from the “fair share” 50/50 point.

Of course, there’s also some variation in the content of the leaf blocks because they tend to start refilling shortly after they’ve split, so it can be quite instructive (when your system reaches “steady state” to produce a “histogram” of leaf contents – which is what the last SQL statement in my setup script is about, with the following results:

Click here to expand the index histogram report
ROWS_PER_BLOCK BLOCK_COUNT
-------------- -----------
            24           1
            26           1
            27           1
            28           5
            29           5
            30           7
            31          11
            32          11
            33          26
            34          23
            35          28
            36          28
            37          49
            38          47
            39          43
            40          49
            41          62
            42          73
            43          81
            44          92
            45          98
            46          91
            47         117
            48         104
            49         124
            50         124
            51         117
            52         114
            53         106
            54         123
            55         109
            56         104
            57          96
            58          84
            59          70
            60          95
            61          57
            62          73
            63          77
            64          74
            65          66
            66          56
            67          52
            68          54
            69          59
            70          44
            71          56
            72          49
            73          47
            74          51
            75          27
            76          34
            77          29
            78          27
            79          25
            80          27
            81          28
            82          26
            83          16
            84          23
            85          16
            86          18
            87          11
            88          19
            89          16
            90          10
            91          11
            92           5
            93           5
            94           2
            95           3
            96           4
            97           3
            99           3
           100           4
           103           1
           107           1
           119           1
78 rows selected.

The result (because it’s randomly arriving values) is fairly close to the bell curve of the Normal distribution centred at around 50 rows. There’s a fairly long tail up to 119 rows, but that’s probably there in this case because the index state hadn’t quite reached steady state before I did the big delete.

Having dumped a leaf block I know that a completely packed leaf block could hold 420 rows, and at pctfree 10 that would mean 378 rows, and at 70% utilisation (which is what I expect with random arrival) an average of 294 rows generating an index of 3,400 leaf blocks rather than the 3,529 I got. (Again, I think the divergence from expectation are probably related to needing more time to get to steady state.)

The final call in the script is to a stripped down version of some code I published a few years back; the relevance of the numbers when applied to indexes is described in this blog note and the numbers were as follows:

Unformatted                   :           62 /          507,904
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :           45 /          368,640
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :            0 /                0
Full                          :        3,545 /       29,040,640

PL/SQL procedure successfully completed.

Segment Total blocks:        3,712
Object Unused blocks:            0

PL/SQL procedure successfully completed.

Freespace 2 is the label given to the set of blocks that are available for use (empty) whether or not they are in the index structure. Given the pattern of work so far it’s fairly safe to assume that in this case they are “formatted but not yet attached to the index structure”.

A quick arithmetic check highlights an apparent discrepancy: 62 + 45 + 3,545 = 3,652, which is 60 blocks short of the number in the segment; but that’s okay because I have 29 uniform extents of 1MB in the segment, which means 2 space management level 1 bitmap blocks per extent plus one level 2 bitmap block, plus the segment header / level 3 bitmap block – for a total of 60 space management blocks.

The thing I’m not keen on is that the space management blocks are reporting 3,545 Full blocks, when the treedump showed 3,538 blocks – where did the extra 7 come from. But I’m not going to worry about that for this blog note.

Tests and results

The following block of code shows the full set of logging and tracing that I did – though I didn’t use every single diagnostic in every single run – for each of the three options. The code in this case is wrapped around a call to coalesce:

alter session set tracefile_identifier = 'coalesce';
alter session set events 'immediate trace name treedump level &m_ind_id';

execute snap_enqueues.start_snap
execute snap_rollstats.start_snap
execute snap_my_stats.start_snap
execute snap_redo.start_snap

alter session set events 'trace[ksq] disk=medium';

column current_scn new_value start_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter index t1_i1 coalesce;

column current_scn new_value end_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter session set events 'trace[ksq] off';

execute snap_redo.end_snap
execute snap_my_stats.end_snap
execute snap_rollstats.end_snap
execute snap_enqueues.end_snap

alter session set events 'immediate trace name treedump level &m_ind_id';

alter session set tracefile_identifier='coalesce_redo';
alter system dump redo scn min &start_scn scn max &end_scn ;
alter session set tracefile_identifier='';

@@dbms_space_use_assm_embedded test_user t1_i1 index
@@index_histogram_embedded t1 t1_i1 &m_ind_id

Working from the top down:

  • Set an identifier to include in the trace file name.
  • take a starting treedump (which will go to that trace file)
  • take starting snapshots of
    • system level enqueue stats
    • system leve rollback stats
    • my session activity stats
    • a subset of session stats relating to redo
  • enable tracing of Enqueues (locks)
  • capture the current SCN in a define variable
  • coalesce the index
  • capture the final SCN in a define variable
  • report the change in the 4 sets of stats listed above
  • save the ending treedump to the trace file
  • set a new identifier for the tracefile name
  • dump all the redo generated while the coalesce was going on to the new tracefile
  • Call a script to report the space usage for the index segment
  • Call a script to report the histogram of leaf block usage again

The starting treedump will match the “post-delete” treedump above, of course, but it’s just a convenience for each test to have its before and after treedumps in the same trace file; and the redo dump (which will include redo from every active session) is so large – about 275MB – that it’s a good idea to keep it separate from the treedumps and enqueue trace.

The histogram script is just a wrapper for the two sys_op_lbid() queries shown earlier on. The space usage script is one we’ve already met.

A test run takes only a couple of minutes – and most of the time is spent inserting 1M rows into an indexed table one at a time. (The time it took to analyze the logs, traces and dumps is much longer, and the time to summarize and write up the results is longer still!)

Here, then, are the most interesting details from the three tests. Some of the comments I make are not immediately “proved” by the results I’m showing, but the volume of data required to supply corroborative evidence would become excessive and very boring.

Coalesce

The first “big picture” item to look at after the coalesce is the space usage:

Unformatted                   :           62 /          507,904
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :        3,037 /       24,879,104
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :            0 /                0
Full                          :          553 /        4,530,176

PL/SQL procedure successfully completed.

Segment Total blocks:        3,712
Object Unused blocks:            0

The index segment is 3,712 blocks, of which 553 are “Full”, and 3,037 are in the “Freespace 2” state which, for indexes, means they are empty and available for reuse. The coalesce hasn’t released space back to the tablespace but we can’t tell from these figures whether the 553 blocks full blocks are packed into the “bottom end” of the segment or scattered across the entire length of the segment. Or, to view it another way, the figues don’t tell us whether Oracle has been shuffling rows without completely re-arranging the block linkages or whether it’s also been moving rows so that it can reconnect leaf blocks in a way that leaves all the empty blocks above a notional highwater mark.

We can dig a little deeper by looking at the treedump:

branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 64, level: 1)
      leaf: 0x900016d 150995309 (-1: row:377.377 avs:833)
      leaf: 0x90014d3 151000275 (0: row:377.377 avs:833)
      leaf: 0x900118c 150999436 (1: row:377.377 avs:833)
      leaf: 0x9000370 150995824 (2: row:377.377 avs:833)
...
      leaf: 0x9000d2f 150998319 (61: row:377.377 avs:833)
      leaf: 0x9000d67 150998375 (62: row:114.114 avs:5830)
   branch: 0x9000e45 150998597 (0: nrow: 59, level: 1)
      leaf: 0x900047a 150996090 (-1: row:377.377 avs:833)
      leaf: 0x9000725 150996773 (0: row:377.377 avs:833)

...
...
      leaf: 0x9000a05 150997509 (67: row:377.377 avs:833)
      leaf: 0x900030d 150995725 (68: row:376.376 avs:852)
   branch: 0x9000c68 150998120 (6: nrow: 76, level: 1)
      leaf: 0x90001b2 150995378 (-1: row:60.60 avs:6856)
      leaf: 0x9001323 150999843 (0: row:377.377 avs:833)

The root block is still reporting the same number of level 1 branch blocks, but the branch blocks report far fewer leaf blocks each. Most of the leaf blocks report 377 index entries, but the first and last leaf blocks of each branch tend to show fewer.

I pointed out earlier on that with pctfree 10 we’d get 378 rows per leaf block if we recreated the index, but it looks like there’s a little overhead I didn’t allow for and we’ve actually got 377 from the coalesce. You’ll notice that a coalesce will actually reduce the number of index entries in a leaf block if it exceeds the limit set by pctfree (remember how the original treedump extracts showed one leaf block with 411 entries).

Coalesce does not act “across” branch blocks, which is why (a) the number of branch blocks is unchanged, and (b) why the number of rows in the last leaf block of a branch block may have fewer rows than the typical leaf blocks – coalesce will not move rows from the first leaf block of the next branch.

I’ve included a few lines from around the branches numbered 0 and 6 in this extract. If you compare them with the treedump taken just after the delete you’ll see that the coalesce has copied rows back from the second (0th) leaf of branch 0 into the first (-1th) leaf , but not from the second (0th) leaf into the first (-1th) leaf of branch 6. I don’t know why this is but perhaps it’s something to do with the relative number of rows in the first and second (-1th and 0th) leaf blocks – the same behaviour showed up at the start of branch 3 where the two leaf blocks had 58 and 63 rows respectively.

Getting back to the question of whether the “Freespace 2” blocks reported by the space usage procedure are still in the structure or whether they have been unlinked – the number of leaf blocks reported per branch block is fairly convincing – the empty leaf blocks have been detached from the structure and are simply flagged as free blocks in the space management level 1 bitmap. We could do a quick check of all the branch blocks (grep ” branch” from the trace file):

branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 64, level: 1)
   branch: 0x9000e45 150998597 (0: nrow: 59, level: 1)
   branch: 0x90007d1 150996945 (1: nrow: 61, level: 1)
   branch: 0x9000e8a 150998666 (2: nrow: 66, level: 1)
   branch: 0x900043c 150996028 (3: nrow: 70, level: 1)
   branch: 0x9000e18 150998552 (4: nrow: 70, level: 1)
   branch: 0x900073d 150996797 (5: nrow: 70, level: 1)
   branch: 0x9000c68 150998120 (6: nrow: 76, level: 1)

Add up the nrow for the level 1 branches and you get 536; add 9 for the branch blocks themselves and you get 545 – and the space usage report says 553 (an unexplained error of 8 which I’ll get round to worrying about one day; I wonder if there’s any significance in how close it is to the error of 7 that we had before the coalesce).

We can learn more from the tree dump by walking the leaf blocks in order and checking their block addresses.

  • The first leaf block of the first level 1 branch block is 0x900016d before and after the coalesce.
  • The second leaf block of the this branch block is 0x90014d3 after the coalesce, but that was the address of leaf block number 7 before the coalesce.
  • The third leaf block is 0x900118c after the coalesce but was leaf block number 15 before the coalesce.

The coalesce has been walking the index in order, copying rows back to earlier leaf blocks and unlinking the block it’s working on if it becomes empty. The ultimate effect of this is that the final set of index leaf blocks isn’t compacted into the smallest contiguous space possible, it’s scattered just as widely and randomly across the whole segment as it was before the coalesce.

We could go one step further to demonstrate this scattering. Extract all the lines for leaf blocks from the treedump and sort them into order. Since I’m using 1MB exents I’d like to see (nearly) 128 consecutive block addresses in order before a possible jump to a block in the next extent but here are first few addresses when I do the experiment:

      leaf: 0x9000105 150995205 (59: row:377.377 avs:833)
      leaf: 0x9000108 150995208 (3: row:377.377 avs:833)
      leaf: 0x900010a 150995210 (52: row:377.377 avs:833)
      leaf: 0x9000115 150995221 (12: row:377.377 avs:833)
      leaf: 0x9000117 150995223 (63: row:377.377 avs:833)
      leaf: 0x900011e 150995230 (3: row:377.377 avs:833)
      leaf: 0x900011f 150995231 (53: row:377.377 avs:833)
      leaf: 0x900012b 150995243 (34: row:377.377 avs:833)
      leaf: 0x9000137 150995255 (34: row:377.377 avs:833)
      leaf: 0x900013a 150995258 (63: row:377.377 avs:833)
      leaf: 0x900013d 150995261 (43: row:377.377 avs:833)

You don’t have to be skilled at reading hex numbers to see all the gaps between the used block addresses.

Coalesce – Transactions

We now know where the index has got to, so the next question is how did it get there. The snapshot showing the change in rollback statistics (v$rollstat) is revealing.

USN   Ex Size K  HWM K  Opt K      Writes     Gets  Waits Shr Grow Shr K  Act K
----  -- ------  -----  -----      ------     ----  ----- --- ---- ----- ------
   0   0      0      0      0           0        1      0   0    0     0      0
   1   5   5120      0      0     5101714     1199      0   0    5     0 -28446
   2   6   6144      0      0     5278032     1245      0   0    6     0    275
   3   7   7168      0      0     5834744     1365      0   0    7     0  -1492
   4   1   8192      0      0     5944580     1378      0   0    1     0 -17281
   5   6   6144      0      0     5126248     1203      0   0    6     0    303
   6   4   4096      0      0     5076808     1189      0   0    4     0    -72
   7   6   6144      0      0     5244984     1239      0   0    6     0    127
   8   7   7168      0      0     5818394     1363      0   0    7     0    263
   9   7   7168      0      0     6017230     1401      0   0    7     0    213
  10   1   8192   8192      0     5060154     1178      0   0    1     0 -54488

My session was the only one active on the system, and it’s only a small play system so the only undo segments it has are the basic 10 that appear when you create the database (plus the one rollback segment in the SYSTEM tablespace).

The critical numbers are the writes (bytes) and gets (blocks), which tell us that our single operation has behaved as a number of individual transactions that have been starting in different undo segments.

Given the fairly even spread of bytes written it’s a good bet that we’re seeing a fairly large number of fairly small transactions. We can corroborate this by looking at the snapshot of enqueue (lock) stats (v$enqueue_stats):

Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
CF             2           0           2           0           0 contention
XR             1           0           1           0           0 database force logging
TM             1           0           1           0           0 contention
TX         3,051           0       3,051           0           0 contention
HW            50           0          50           0           0 contention
TT            50           0          50           0           0 contention
CU            24           0          24           0           0 contention
OD             1           0           1           0           0 Serializing DDLs
JG           126           0         126           0           0 queue lock
JG            12           0          12           0           0 q mem clnup lck
JG           126           0         126           0           0 contention

The enqueue we’re interested in is the TX (transaction) enqueue – and Oracle reports more than 3,000 of them in the interval. (That’s interestingly close to the number of blocks in the index or, to be even fussier, the number of leaf blocks that have been emptied – but that might be a coincidence.)

You’ll notice, though, that there’s only 1 TM (table) lock request. Whatever else we’re doing we’re not locking and unlocking the table on every single transaction – so we need to find out what that lock is and whether it might be a threat to our application (a TM lock in mode 4, 5, or 6, held for the duration would be a disaster). And that’s why I enabled the ksq (enqueue) trace – here’s the extract from the trace file showing the acquisition of the TM lock.

2022-09-02 11:36:57.645*:ksq.c@9175:ksqgtlctx(): *** TM-000230A3-00000000-0039DED3-00000000 mode=2 flags=0x401 why=173 timeout=0 ***
2022-09-02 11:36:57.645*:ksq.c@9183:ksqgtlctx(): xcb=0x9bbeec68, ktcdix=2147483647 topxcb=0x9bbeec68 ktcipt(topxcb)=0x0
2022-09-02 11:36:57.645*:ksq.c@9203:ksqgtlctx(): ksqgtlctx: Initializing lock structure
2022-09-02 11:36:57.645*:ksq.c@9324:ksqgtlctx(): DID DUMP START
2022-09-02 11:36:57.645*:ksq.c@9328:ksqgtlctx():        ksqlkdid: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9333:ksqgtlctx():        tktcmydid: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9337:ksqgtlctx():        tksusesdi: 0000-0000-00000000
2022-09-02 11:36:57.645*:ksq.c@9341:ksqgtlctx():        tksusetxn: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9343:ksqgtlctx(): DID DUMP END
2022-09-02 11:36:57.645*:ksq.c@9517:ksqgtlctx(): ksqgtlctx: did not find link
2022-09-02 11:36:57.645*:ksq.c@9687:ksqgtlctx(): ksqgtlctx: updated ksqlrar1, ksqlrar:0x9e7f7cb8, ksqlral:(nil)
2022-09-02 11:36:57.645*:ksq.c@9841:ksqgtlctx(): ksqgtlctx: updated ksqlral, ksqlral:0x9bac7bc0, res:0x9e7f7cb8
2022-09-02 11:36:57.645*:ksq.c@9851:ksqgtlctx(): ksqgtlctx: updated lock mode, mode:2 req:0
2022-09-02 11:36:57.645*:ksq.c@9960:ksqgtlctx(): SUCCESS

I’ve highlighted the line where the TM lock appears, reporting an “id1” of 000230A3, which is the object_id of the table t1. Take note of the other highlighted line which gives the address of the resource element used (res: 0x9e7f7cb8) because we can use this to find where the lock is released:

2022-09-02 11:36:58.735*:ksq.c@10367:ksqrcli_int(): ksqrcli_int: updated ksqlral, ksqlral:0x9bac7bc0, res:0x9e7f7cb8
2022-09-02 11:36:58.735*:ksq.c@10501:ksqrcli_int(): returns 0

This appears in the last few lines of the ksq trace, after the appearance of several thousand (brief) TX locks that have been acquired and released. So there is a low-impact table lock held for the duration of the coalesce that is not going to stop other sessions from updating the table (and its indexes).

There was one other lock released after the TM lock:

2022-09-02 11:36:58.769*:ksq.c@10367:ksqrcli_int(): ksqrcli_int: updated ksqlral, ksqlral:0x9e6b2370, res:0x9e7f0788
2022-09-02 11:36:58.769*:ksq.c@10501:ksqrcli_int(): returns 0

Working backwards using the resource address we find that this was an OD lock, taken immediately after the TM lock:

2022-09-02 11:36:57.645*:ksq.c@9175:ksqgtlctx(): *** OD-000230A4-00000000-0039DED3-00000000 mode=4 flags=0x10001 why=277 timeout=0 ***
2022-09-02 11:36:57.645*:ksq.c@9183:ksqgtlctx(): xcb=0x9bbeec68, ktcdix=2147483647 topxcb=0x9bbeec68 ktcipt(topxcb)=0x0
2022-09-02 11:36:57.645*:ksq.c@9203:ksqgtlctx(): ksqgtlctx: Initializing lock structure
2022-09-02 11:36:57.645*:ksq.c@9324:ksqgtlctx(): DID DUMP START
2022-09-02 11:36:57.645*:ksq.c@9328:ksqgtlctx():        ksqlkdid: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9333:ksqgtlctx():        tktcmydid: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9337:ksqgtlctx():        tksusesdi: 0000-0000-00000000
2022-09-02 11:36:57.645*:ksq.c@9341:ksqgtlctx():        tksusetxn: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9343:ksqgtlctx(): DID DUMP END
2022-09-02 11:36:57.645*:ksq.c@9517:ksqgtlctx(): ksqgtlctx: did not find link
2022-09-02 11:36:57.645*:ksq.c@9687:ksqgtlctx(): ksqgtlctx: updated ksqlrar1, ksqlrar:0x9e7f0788, ksqlral:(nil)
2022-09-02 11:36:57.645*:ksq.c@9841:ksqgtlctx(): ksqgtlctx: updated ksqlral, ksqlral:0x9e6b2370, res:0x9e7f0788
2022-09-02 11:36:57.645*:ksq.c@9851:ksqgtlctx(): ksqgtlctx: updated lock mode, mode:4 req:0
2022-09-02 11:36:57.645*:ksq.c@9960:ksqgtlctx(): SUCCESS

Checking v$lock_type we see that the OD lock is the “Online DDLs” lock, with the description “Lock to prevent concurrent online DDLs” and its first parameter is the object_id of the object that is the target of the DDL. The value in the trace file (000230A4) identifies the index that we are coalescing; at mode 4 the lock mode is fairly aggressive, but I’m surprised that it isn’t 6 – if we were to interpret the value the way we would for TM locks it would suggest that two sessions could coalesce the index at the same time!

Apart from 50 pairs of TT/HW locks (tablespace DDL / Segment Highwater mark) due to undo segments growing and shrinking, the rest of the ksq trace was taken up by 3,051 TX locks, typically reporting their acquisition and release on adjacent lines of the trace, e.g.:

2022-09-02 11:36:57.650*:ksq.c@9100:ksqgtlctx(): ksqtgtlctx: PDB mode
2022-09-02 11:36:57.650*:ksq.c@9175:ksqgtlctx(): *** TX-0004001C-00002F83-0039DED3-00000000 mode=6 flags=0x401 why=176 timeout=0 ***
2022-09-02 11:36:57.650*:ksq.c@9183:ksqgtlctx(): xcb=0x9bd37110, ktcdix=2147483647 topxcb=0x9bbeec68 ktcipt(topxcb)=0x0
2022-09-02 11:36:57.650*:ksq.c@9203:ksqgtlctx(): ksqgtlctx: Initializing lock structure
2022-09-02 11:36:57.650*:ksq.c@9324:ksqgtlctx(): DID DUMP START
2022-09-02 11:36:57.650*:ksq.c@9328:ksqgtlctx():        ksqlkdid: 0001-0029-0000013C
2022-09-02 11:36:57.650*:ksq.c@9333:ksqgtlctx():        tktcmydid: 0001-0029-0000013C
2022-09-02 11:36:57.650*:ksq.c@9337:ksqgtlctx():        tksusesdi: 0000-0000-00000000
2022-09-02 11:36:57.650*:ksq.c@9341:ksqgtlctx():        tksusetxn: 0001-0029-0000013C
2022-09-02 11:36:57.650*:ksq.c@9343:ksqgtlctx(): DID DUMP END
2022-09-02 11:36:57.650*:ksq.c@9517:ksqgtlctx(): ksqgtlctx: did not find link
2022-09-02 11:36:57.650*:ksq.c@9687:ksqgtlctx(): ksqgtlctx: updated ksqlrar1, ksqlrar:0x9e80b098, ksqlral:(nil)
2022-09-02 11:36:57.650*:ksq.c@9841:ksqgtlctx(): ksqgtlctx: updated ksqlral, ksqlral:0x9bd37148, res:0x9e80b098
2022-09-02 11:36:57.650*:ksq.c@9851:ksqgtlctx(): ksqgtlctx: updated lock mode, mode:6 req:0
2022-09-02 11:36:57.650*:ksq.c@9960:ksqgtlctx(): SUCCESS
2022-09-02 11:36:57.650*:ksq.c@10367:ksqrcli_int(): ksqrcli_int: updated ksqlral, ksqlral:0x9bd37148, res:0x9e80b098
2022-09-02 11:36:57.650*:ksq.c@10501:ksqrcli_int(): returns 0
Coalesce – Workload

We’ve examined the end result of a coalesce, and seen something of the mechanism that Oracle adopts to get to that result, but what does it cost (in terms of work done)? In many cases it’s sufficient to limit the analysis to:

  • how much I/O
  • how much CPU
  • how much undo and redo generated

The obvious I/O comes from the requirement to walk the index in leaf block order, and the dbwr will eventually have to write back every block (including the empty ones). But that I/O, and the inevitable CPU usage is not particularly interesting, what’s more interesting (and more of a threat) is the impact of the undo and redo. This is where the snapsthos of session stats and redo stats give us the information we need to know, and all I’m going to look at are the redo-related stats for the test:

Name                                                                     Value
----                                                                     -----
messages sent                                                               92
calls to kcmgcs                                                             81
calls to kcmgas                                                          6,118
calls to get snapshot scn: kcmgss                                        3,075
redo entries                                                            34,810
redo size                                                           76,475,936
redo buffer allocation retries                                              39
redo subscn max counts                                                   1,049
redo synch time                                                              3
redo synch time (usec)                                                  33,207
redo synch time overhead (usec)                                            159
redo synch time overhead count (  2ms)                                       1
redo synch writes                                                            1
redo write info find                                                         1
undo change vector size                                             55,062,672
rollback changes - undo records applied                                    287

Bearing in mind that this index started at roughly 3,600 blocks / 28MB and coalesced to roughly 560 blocks / 4.5MB I want to draw your attention to just three of the figures (highlighted): the number and total size of redo records generated, and the volume of undo generated. 23,810 redo records, 75MB of redo, of which 55MB was due to undo.

It’s nice to see that the undo figure is consistent with the sum of the writes we saw in the snapshot of v$rollstat. But the numbers warn us that there’s a lot of work going into a coalesce – and it could have a big impact on other users.

My session is generating a lot of undo, and it’s cycling through every undo segment as it does so – that means other sessions that need to create read-consistent images of recently changed data blocks that are completely unrelated to my index may have to work backwards through a large number of undo blocks trying to find upper bound SCNs (check for statistics like: ‘transaction tables consistent read%’)

You’ll notice that I’ve also reported “rollback changes – undo records applied”; these are appearing because of “DML restarts” that make a statement roll back and try again the first time it triggers an undo segment extension. Luckily all my transactions are very small so each individual transaction won’t suffer much if it has to restart, but if you have a long running DML statement and I keep filling and extending undo segments (possibly shrinking other undo segments to do so) that’s going to increase your chances of finding your undo segment full and doing a huge rollback and restart of your statement. Be very careful about timing your coalesce commands.

Since I’ve dumped all the redo generated during the test run I’ll finish by showing a little analysis of the results. The trace file for this 28MB index was over 250MB so it’s not something you’d dump on a production size coalesce.

All I’m going to do is use grep to pull out the redo OP codes of every change vector in the file and show you a couple of extracts from the results. First a commonly occurring pattern:

CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:31 AFN:17 DBA:0x04402750 OBJ:4294967295 SCN:0x00000000024f685d SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:1 CLS:32 AFN:17 DBA:0x0440ec96 OBJ:4294967295 SCN:0x00000000024f686e SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:32 AFN:17 DBA:0x0440ec96 OBJ:4294967295 SCN:0x00000000024f686e SEQ:2 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:32 AFN:17 DBA:0x0440ec96 OBJ:4294967295 SCN:0x00000000024f686e SEQ:3 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x0900018a OBJ:143528 SCN:0x00000000024f67c1 SEQ:1 OP:10.11 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x09000438 OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:32 AFN:17 DBA:0x0440ec96 OBJ:4294967295 SCN:0x00000000024f686e SEQ:4 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000438 OBJ:143528 
SCN:0x00000000024f686e SEQ:1 OP:10.39 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:31 AFN:17 DBA:0x04402750 OBJ:4294967295 SCN:0x00000000024f686e SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:1 CLS:32 AFN:17 DBA:0x0440ec97 OBJ:4294967295 SCN:0x00000000024f686e SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686e SEQ:2 OP:10.8 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:31 AFN:17 DBA:0x04402750 OBJ:4294967295 SCN:0x00000000024f686e SEQ:2 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:1 CLS:32 AFN:17 DBA:0x0440ec98 OBJ:4294967295 SCN:0x00000000024f686e SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686e SEQ:2 OP:10.34 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09001500 OBJ:143528 SCN:0x00000000024f685a SEQ:1 OP:13.22 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:31 AFN:17 DBA:0x04402750 OBJ:4294967295 SCN:0x00000000024f686e SEQ:3 OP:5.4 ENC:0 RBL:0 FLG:0x0000

The last line is Op Code 5.4, a commit (or rollback), and I picked a batch of rows between one commit and the next, so this entire set of 20 change vectors is a single transaction taking place in the coalesce. I’ve placed gaps before every “Change #1” to show the boundaries between redo records. As you can see, my “common pattern” transaction is 11 redo records; that’s another sanity check: we saw roughly 3,000 TX enqueues, and 34,800 redo entries: 11 * 3,000 = 33,000, which is a good enough match.

Op Code 5.2 is “get next undo block”, Op Code 5.1 is “create undo record”, so I’m going to simplify the list by removing those codes. Removing some of the irrelevant material from the start and end of each line the example reduces to:

DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 
DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 
DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.6 
DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.6 
DBA:0x0900018a OBJ:143528 SCN:0x00000000024f67c1 SEQ:1 OP:10.11 
DBA:0x09000438 OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 
DBA:0x09000438 OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.39
DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686e SEQ:2 OP:10.8 
DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686e SEQ:2 OP:10.34 
DBA:0x09001500 OBJ:143528 SCN:0x00000000024f685a SEQ:1 OP:13.22 

Translating the OP Codes (and adding in a little information I have about which blocks the block addresses (DBA) correspond to) this is what the transaction does

  • block cleanout of leaf block 0x090014d3 (4.1)
  • block cleanout of leaf block 0x0900152e (4.1)
  • lock leaf block 0x0900152e (10.6)
  • lock leaf block 0x090014d3 (10.6)
  • change the “pointer to previous” of leaf block 0x0900018a (10.11)
  • block cleanout of branch block 0x09000438 (4.1)
  • update branch block 0x09000438, delete one leaf entry (10.39)
  • create new version of leaf block 0x090014d3 (10.8)
  • create empty version of leaf block 0x0900152e (10.34)
  • update space management level 1 bitmap block (13.22)

So where does the huge amount of redo appear. If we looked at the 11 Redo Record Headers for the extract we could use the LEN information to point us to the cirtical bits:

REDO RECORD - Thread:1 RBA: 0x000391.000174dc.01c0 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174dd.0028 LEN: 0x0060 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174dd.0088 LEN: 0x0164 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174de.0010 LEN: 0x00e4 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174de.00f4 LEN: 0x00ec VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174de.01e0 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174df.0048 LEN: 0x0104 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174df.014c LEN: 0x3a88 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174fd.01b4 LEN: 0x20bc VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.0001750e.0180 LEN: 0x0064 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.0001750e.01e4 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595

I’ve highlighted the two big ones – records 8 and 9, which are the ones holding the 10.8 (create new leaf) and 10.34 (make block empty). Why are they so big at 14,984 bytes and 8,380 bytes respectively?

Record 8 includes a change vector (5.1) for the undo of the replaced block which is a block image at 8,032 bytes, and a change vector for the new version of the block in a format similar to an array insert which happened to have 344 rows at this point for a size of roughly 6,500 bytes.

Record 9 includes a change vector (5.1) for the undo of the emptied block, again a block image of 8,032 bytes. But the 10.34 itself is only a few tens of bytes.

This test highlights a particularly nasty threat from coalesce and its “pairwise” clean-up. Checking the “post-delete” tree dump I can see I’ve emptied leaf block 0x0900152e by copying 30 rows back into leaf block 0x090014d3, and I can see that this is the fifth leaf block that I’ve emptied into 0x090014d3, and I can see that I’ll be doing one more pass to get that block full; and each time I do this I dump two block images, and an “array-update” redo change vector that gets bigger and bigger on each pass until it’s nearly the full 8KB. The operation generates a lot of undo and a lot of redo.

Coalesce – concurrency

As a quick test of what happens when other work is going on on the table I ran a little script to insert an extra 100 rows (without committing) into the table just after the big delete but just before the coalesce, generating random values from the same range as the original values.

The coalesce didn’t seem to take any extra time and I didn’t see any enqueue waits or buffer busy waits (though a different test of 3,000 rapid single row inserts with commits while the coalesce was running manage to get one buffer busy wait on a branch block).

The final result, though was not very good. With 100 uncommitted inserts getting in the way the index report 687 “full” blocks rather than the 553 that we saw originally. That’s an increase of more than one block per row inserted.

Basically when Oracle hits a block with an uncommitted change it looks as if it says – “I can’t copy those rows backwards so I’ll have to leave the current block wherever I’ve got to, skip the modified block and restart the coalesce in the next block along” So every block with an uncommitted change could result in two extra blocks ultimately not being packed as well as they could be.

Click here if you want to see the full treedump
----- begin tree dump
branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 78, level: 1)
      leaf: 0x900016d 150995309 (-1: row:377.377 avs:833)
      leaf: 0x90014d3 151000275 (0: row:377.377 avs:833)
      leaf: 0x900118c 150999436 (1: row:377.377 avs:833)
      leaf: 0x9000370 150995824 (2: row:377.377 avs:833)
      leaf: 0x900011e 150995230 (3: row:120.120 avs:5716)
      leaf: 0x9000a0e 150997518 (4: row:49.49 avs:7065)
      leaf: 0x9001258 150999640 (5: row:377.377 avs:833)
      leaf: 0x9000658 150996568 (6: row:377.377 avs:833)
      leaf: 0x9000b8b 150997899 (7: row:377.377 avs:833)
      leaf: 0x9000155 150995285 (8: row:377.377 avs:833)
      leaf: 0x9000ba1 150997921 (9: row:377.377 avs:833)
      leaf: 0x900063c 150996540 (10: row:377.377 avs:833)
      leaf: 0x9000d3b 150998331 (11: row:377.377 avs:833)
      leaf: 0x9000469 150996073 (12: row:114.114 avs:5830)
      leaf: 0x9000bfb 150998011 (13: row:51.51 avs:7027)
      leaf: 0x900155d 151000413 (14: row:377.377 avs:833)
      leaf: 0x9000ba2 150997922 (15: row:377.377 avs:833)
      leaf: 0x9001512 151000338 (16: row:377.377 avs:833)
      leaf: 0x9000d74 150998388 (17: row:377.377 avs:833)
      leaf: 0x90005be 150996414 (18: row:377.377 avs:833)
      leaf: 0x90005b4 150996404 (19: row:377.377 avs:833)
      leaf: 0x9000c24 150998052 (20: row:377.377 avs:833)
      leaf: 0x90001a7 150995367 (21: row:377.377 avs:833)
      leaf: 0x9001563 151000419 (22: row:377.377 avs:833)
      leaf: 0x9000a68 150997608 (23: row:377.377 avs:833)
      leaf: 0x90011a7 150999463 (24: row:377.377 avs:833)
      leaf: 0x90001b9 150995385 (25: row:327.327 avs:1783)
      leaf: 0x9000ab6 150997686 (26: row:52.52 avs:7008)
      leaf: 0x900149f 151000223 (27: row:377.377 avs:833)
      leaf: 0x900123d 150999613 (28: row:328.328 avs:1764)
      leaf: 0x900033c 150995772 (29: row:89.89 avs:6305)
      leaf: 0x9000b9d 150997917 (30: row:45.45 avs:7141)
      leaf: 0x90014ad 151000237 (31: row:377.377 avs:833)
      leaf: 0x9000d40 150998336 (32: row:377.377 avs:833)
      leaf: 0x9000166 150995302 (33: row:377.377 avs:833)
      leaf: 0x9000c70 150998128 (34: row:377.377 avs:833)
      leaf: 0x90014e9 151000297 (35: row:377.377 avs:833)
      leaf: 0x9000d32 150998322 (36: row:377.377 avs:833)
      leaf: 0x9001546 151000390 (37: row:377.377 avs:833)
      leaf: 0x900018b 150995339 (38: row:377.377 avs:833)
      leaf: 0x9000b8a 150997898 (39: row:377.377 avs:833)
      leaf: 0x9000c4c 150998092 (40: row:169.169 avs:4785)
      leaf: 0x9000a89 150997641 (41: row:69.69 avs:6685)
      leaf: 0x9000335 150995765 (42: row:306.306 avs:2182)
      leaf: 0x9000196 150995350 (43: row:51.51 avs:7027)
      leaf: 0x900134d 150999885 (44: row:377.377 avs:833)
      leaf: 0x90005a7 150996391 (45: row:243.243 avs:3379)
      leaf: 0x9000116 150995222 (46: row:54.54 avs:6970)
      leaf: 0x900130b 150999819 (47: row:377.377 avs:833)
      leaf: 0x9000604 150996484 (48: row:377.377 avs:833)
      leaf: 0x9000655 150996565 (49: row:377.377 avs:833)
      leaf: 0x9000d7d 150998397 (50: row:185.185 avs:4481)
      leaf: 0x9000958 150997336 (51: row:53.53 avs:6989)
      leaf: 0x90011aa 150999466 (52: row:377.377 avs:833)
      leaf: 0x900102a 150999082 (53: row:377.377 avs:833)
      leaf: 0x9000a6d 150997613 (54: row:377.377 avs:833)
      leaf: 0x90014f8 151000312 (55: row:377.377 avs:833)
      leaf: 0x900135d 150999901 (56: row:344.344 avs:1460)
      leaf: 0x9000a37 150997559 (57: row:45.45 avs:7141)
      leaf: 0x900122d 150999597 (58: row:341.341 avs:1517)
      leaf: 0x9000afe 150997758 (59: row:51.51 avs:7027)
      leaf: 0x9001536 151000374 (60: row:377.377 avs:833)
      leaf: 0x9000644 150996548 (61: row:113.113 avs:5849)
      leaf: 0x9000c69 150998121 (62: row:75.75 avs:6571)
      leaf: 0x9000141 150995265 (63: row:377.377 avs:833)
      leaf: 0x9000c56 150998102 (64: row:377.377 avs:833)
      leaf: 0x900059d 150996381 (65: row:377.377 avs:833)
      leaf: 0x9000c46 150998086 (66: row:377.377 avs:833)
      leaf: 0x9000af0 150997744 (67: row:377.377 avs:833)
      leaf: 0x90001bd 150995389 (68: row:377.377 avs:833)
      leaf: 0x9000baa 150997930 (69: row:377.377 avs:833)
      leaf: 0x9000c4f 150998095 (70: row:377.377 avs:833)
      leaf: 0x9000639 150996537 (71: row:377.377 avs:833)
      leaf: 0x9000c1f 150998047 (72: row:377.377 avs:833)
      leaf: 0x9000d4a 150998346 (73: row:377.377 avs:833)
      leaf: 0x900153d 151000381 (74: row:377.377 avs:833)
      leaf: 0x9000e74 150998644 (75: row:377.377 avs:833)
      leaf: 0x9000d2a 150998314 (76: row:244.244 avs:3360)
   branch: 0x9000e45 150998597 (0: nrow: 75, level: 1)
      leaf: 0x900047a 150996090 (-1: row:377.377 avs:833)
      leaf: 0x9000725 150996773 (0: row:377.377 avs:833)
      leaf: 0x90001e2 150995426 (1: row:377.377 avs:833)
      leaf: 0x9000e43 150998595 (2: row:156.156 avs:5032)
      leaf: 0x9000e32 150998578 (3: row:61.61 avs:6837)
      leaf: 0x9000258 150995544 (4: row:377.377 avs:833)
      leaf: 0x90007dd 150996957 (5: row:377.377 avs:833)
      leaf: 0x900166e 151000686 (6: row:377.377 avs:833)
      leaf: 0x90001aa 150995370 (7: row:377.377 avs:833)
      leaf: 0x9000c44 150998084 (8: row:377.377 avs:833)
      leaf: 0x90014e8 151000296 (9: row:71.71 avs:6647)
      leaf: 0x900067a 150996602 (10: row:66.66 avs:6742)
      leaf: 0x9000ba9 150997929 (11: row:71.71 avs:6647)
      leaf: 0x900033f 150995775 (12: row:80.80 avs:6476)
      leaf: 0x9000aeb 150997739 (13: row:377.377 avs:833)
      leaf: 0x90013e7 151000039 (14: row:377.377 avs:833)
      leaf: 0x900067b 150996603 (15: row:377.377 avs:833)
      leaf: 0x9000bff 150998015 (16: row:377.377 avs:833)
      leaf: 0x90001b5 150995381 (17: row:377.377 avs:833)
      leaf: 0x9000364 150995812 (18: row:167.167 avs:4823)
      leaf: 0x9001550 151000400 (19: row:42.42 avs:7198)
      leaf: 0x90005e1 150996449 (20: row:377.377 avs:833)
      leaf: 0x9000c37 150998071 (21: row:293.293 avs:2429)
      leaf: 0x900065d 150996573 (22: row:92.92 avs:6248)
      leaf: 0x9000c2d 150998061 (23: row:377.377 avs:833)
      leaf: 0x9000374 150995828 (24: row:377.377 avs:833)
      leaf: 0x900148f 151000207 (25: row:377.377 avs:833)
      leaf: 0x9000e63 150998627 (26: row:377.377 avs:833)
      leaf: 0x9000eb1 150998705 (27: row:377.377 avs:833)
      leaf: 0x9000c30 150998064 (28: row:377.377 avs:833)
      leaf: 0x9000612 150996498 (29: row:256.256 avs:3132)
      leaf: 0x9000c08 150998024 (30: row:68.68 avs:6704)
      leaf: 0x900074e 150996814 (31: row:377.377 avs:833)
      leaf: 0x9000132 150995250 (32: row:305.305 avs:2201)
      leaf: 0x9000473 150996083 (33: row:73.73 avs:6609)
      leaf: 0x9000d0a 150998282 (34: row:68.68 avs:6704)
      leaf: 0x9000755 150996821 (35: row:377.377 avs:833)
      leaf: 0x9000419 150995993 (36: row:377.377 avs:833)
      leaf: 0x9000eeb 150998763 (37: row:377.377 avs:833)
      leaf: 0x9000d04 150998276 (38: row:293.293 avs:2429)
      leaf: 0x9000d62 150998370 (39: row:83.83 avs:6419)
      leaf: 0x9000767 150996839 (40: row:377.377 avs:833)
      leaf: 0x9000323 150995747 (41: row:377.377 avs:833)
      leaf: 0x9000c1e 150998046 (42: row:282.282 avs:2638)
      leaf: 0x9000d5b 150998363 (43: row:58.58 avs:6894)
      leaf: 0x900060f 150996495 (44: row:76.76 avs:6552)
      leaf: 0x9000d5a 150998362 (45: row:65.65 avs:6761)
      leaf: 0x90001f1 150995441 (46: row:377.377 avs:833)
      leaf: 0x9000c7e 150998142 (47: row:377.377 avs:833)
      leaf: 0x900061c 150996508 (48: row:377.377 avs:833)
      leaf: 0x9000159 150995289 (49: row:149.149 avs:5165)
      leaf: 0x9000c5b 150998107 (50: row:73.73 avs:6609)
      leaf: 0x9000605 150996485 (51: row:377.377 avs:833)
      leaf: 0x9000a97 150997655 (52: row:377.377 avs:833)
      leaf: 0x9000b9a 150997914 (53: row:370.370 avs:966)
      leaf: 0x9000c0c 150998028 (54: row:61.61 avs:6837)
      leaf: 0x90001fd 150995453 (55: row:377.377 avs:833)
      leaf: 0x900075d 150996829 (56: row:377.377 avs:833)
      leaf: 0x9001078 150999160 (57: row:377.377 avs:833)
      leaf: 0x9001193 150999443 (58: row:377.377 avs:833)
      leaf: 0x9001334 150999860 (59: row:377.377 avs:833)
      leaf: 0x9000947 150997319 (60: row:377.377 avs:833)
      leaf: 0x9000a98 150997656 (61: row:377.377 avs:833)
      leaf: 0x90008cb 150997195 (62: row:377.377 avs:833)
      leaf: 0x9001020 150999072 (63: row:377.377 avs:833)
      leaf: 0x90001d0 150995408 (64: row:377.377 avs:833)
      leaf: 0x9000221 150995489 (65: row:377.377 avs:833)
      leaf: 0x9000158 150995288 (66: row:377.377 avs:833)
      leaf: 0x900035f 150995807 (67: row:377.377 avs:833)
      leaf: 0x900103a 150999098 (68: row:377.377 avs:833)
      leaf: 0x900156e 151000430 (69: row:377.377 avs:833)
      leaf: 0x9000129 150995241 (70: row:377.377 avs:833)
      leaf: 0x9000349 150995785 (71: row:377.377 avs:833)
      leaf: 0x90014bc 151000252 (72: row:377.377 avs:833)
      leaf: 0x90011ac 150999468 (73: row:329.329 avs:1745)
   branch: 0x90007d1 150996945 (1: nrow: 75, level: 1)
      leaf: 0x900017a 150995322 (-1: row:64.64 avs:6780)
      leaf: 0x9000d6f 150998383 (0: row:202.202 avs:4158)
      leaf: 0x9000472 150996082 (1: row:76.76 avs:6552)
      leaf: 0x9000e50 150998608 (2: row:377.377 avs:833)
      leaf: 0x9000c0d 150998029 (3: row:266.266 avs:2942)
      leaf: 0x9000757 150996823 (4: row:79.79 avs:6495)
      leaf: 0x9000d54 150998356 (5: row:377.377 avs:833)
      leaf: 0x9000c17 150998039 (6: row:377.377 avs:833)
      leaf: 0x9000710 150996752 (7: row:377.377 avs:833)
      leaf: 0x9000107 150995207 (8: row:377.377 avs:833)
      leaf: 0x9000d4d 150998349 (9: row:142.142 avs:5298)
      leaf: 0x9000fdf 150999007 (10: row:48.48 avs:7084)
      leaf: 0x9000250 150995536 (11: row:377.377 avs:833)
      leaf: 0x900125d 150999645 (12: row:304.304 avs:2220)
      leaf: 0x9000431 150996017 (13: row:56.56 avs:6932)
      leaf: 0x9000fda 150999002 (14: row:377.377 avs:833)
      leaf: 0x900047f 150996095 (15: row:377.377 avs:833)
      leaf: 0x90008fb 150997243 (16: row:377.377 avs:833)
      leaf: 0x9000e36 150998582 (17: row:377.377 avs:833)
      leaf: 0x9000e09 150998537 (18: row:377.377 avs:833)
      leaf: 0x90013d9 151000025 (19: row:377.377 avs:833)
      leaf: 0x90008a8 150997160 (20: row:377.377 avs:833)
      leaf: 0x9000fc1 150998977 (21: row:377.377 avs:833)
      leaf: 0x9000435 150996021 (22: row:377.377 avs:833)
      leaf: 0x90008d3 150997203 (23: row:377.377 avs:833)
      leaf: 0x9000d2d 150998317 (24: row:377.377 avs:833)
      leaf: 0x9000d1f 150998303 (25: row:377.377 avs:833)
      leaf: 0x9000ecc 150998732 (26: row:377.377 avs:833)
      leaf: 0x9000eee 150998766 (27: row:377.377 avs:833)
      leaf: 0x90001f9 150995449 (28: row:377.377 avs:833)
      leaf: 0x9000744 150996804 (29: row:377.377 avs:833)
      leaf: 0x900044e 150996046 (30: row:377.377 avs:833)
      leaf: 0x9000136 150995254 (31: row:377.377 avs:833)
      leaf: 0x90007ce 150996942 (32: row:377.377 avs:833)
      leaf: 0x9000476 150996086 (33: row:377.377 avs:833)
      leaf: 0x9000bd0 150997968 (34: row:377.377 avs:833)
      leaf: 0x9000776 150996854 (35: row:377.377 avs:833)
      leaf: 0x9000e76 150998646 (36: row:377.377 avs:833)
      leaf: 0x9000173 150995315 (37: row:377.377 avs:833)
      leaf: 0x9000e15 150998549 (38: row:110.110 avs:5906)
      leaf: 0x9000245 150995525 (39: row:59.59 avs:6875)
      leaf: 0x900102d 150999085 (40: row:377.377 avs:833)
      leaf: 0x90001f3 150995443 (41: row:377.377 avs:833)
      leaf: 0x900034a 150995786 (42: row:377.377 avs:833)
      leaf: 0x9000ede 150998750 (43: row:377.377 avs:833)
      leaf: 0x900024d 150995533 (44: row:377.377 avs:833)
      leaf: 0x90007e2 150996962 (45: row:377.377 avs:833)
      leaf: 0x9000450 150996048 (46: row:377.377 avs:833)
      leaf: 0x900078a 150996874 (47: row:377.377 avs:833)
      leaf: 0x9000e1d 150998557 (48: row:377.377 avs:833)
      leaf: 0x9000e39 150998585 (49: row:377.377 avs:833)
      leaf: 0x9000e19 150998553 (50: row:377.377 avs:833)
      leaf: 0x9000779 150996857 (51: row:377.377 avs:833)
      leaf: 0x9000c21 150998049 (52: row:377.377 avs:833)
      leaf: 0x9000d5d 150998365 (53: row:377.377 avs:833)
      leaf: 0x90004e0 150996192 (54: row:377.377 avs:833)
      leaf: 0x9000498 150996120 (55: row:377.377 avs:833)
      leaf: 0x9000ffe 150999038 (56: row:377.377 avs:833)
      leaf: 0x9000975 150997365 (57: row:377.377 avs:833)
      leaf: 0x90011d5 150999509 (58: row:274.274 avs:2790)
      leaf: 0x9000a52 150997586 (59: row:55.55 avs:6951)
      leaf: 0x9001347 150999879 (60: row:143.143 avs:5279)
      leaf: 0x900097b 150997371 (61: row:66.66 avs:6742)
      leaf: 0x9001129 150999337 (62: row:357.357 avs:1213)
      leaf: 0x90008ad 150997165 (63: row:55.55 avs:6951)
      leaf: 0x9001047 150999111 (64: row:42.42 avs:7198)
      leaf: 0x90004b3 150996147 (65: row:360.360 avs:1156)
      leaf: 0x9001126 150999334 (66: row:50.50 avs:7046)
      leaf: 0x90004d5 150996181 (67: row:59.59 avs:6875)
      leaf: 0x900114f 150999375 (68: row:377.377 avs:833)
      leaf: 0x9000ff6 150999030 (69: row:377.377 avs:833)
      leaf: 0x9000ead 150998701 (70: row:317.317 avs:1973)
      leaf: 0x90007cc 150996940 (71: row:55.55 avs:6951)
      leaf: 0x9000eea 150998762 (72: row:377.377 avs:833)
      leaf: 0x900097f 150997375 (73: row:346.346 avs:1422)
   branch: 0x9000e8a 150998666 (2: nrow: 78, level: 1)
      leaf: 0x9000142 150995266 (-1: row:60.60 avs:6856)
      leaf: 0x90014c1 151000257 (0: row:377.377 avs:833)
      leaf: 0x9001398 150999960 (1: row:377.377 avs:833)
      leaf: 0x900110d 150999309 (2: row:377.377 avs:833)
      leaf: 0x9000372 150995826 (3: row:377.377 avs:833)
      leaf: 0x9001136 150999350 (4: row:377.377 avs:833)
      leaf: 0x9000940 150997312 (5: row:367.367 avs:1023)
      leaf: 0x9000bfe 150998014 (6: row:84.84 avs:6400)
      leaf: 0x90001d1 150995409 (7: row:377.377 avs:833)
      leaf: 0x900112d 150999341 (8: row:377.377 avs:833)
      leaf: 0x9000931 150997297 (9: row:102.102 avs:6058)
      leaf: 0x9001135 150999349 (10: row:56.56 avs:6932)
      leaf: 0x9000170 150995312 (11: row:118.118 avs:5754)
      leaf: 0x90008b5 150997173 (12: row:51.51 avs:7027)
      leaf: 0x90013e2 151000034 (13: row:377.377 avs:833)
      leaf: 0x9001194 150999444 (14: row:377.377 avs:833)
      leaf: 0x9000ee5 150998757 (15: row:377.377 avs:833)
      leaf: 0x9000976 150997366 (16: row:171.171 avs:4747)
      leaf: 0x9001238 150999608 (17: row:57.57 avs:6913)
      leaf: 0x900093d 150997309 (18: row:377.377 avs:833)
      leaf: 0x9000a32 150997554 (19: row:377.377 avs:833)
      leaf: 0x90001ad 150995373 (20: row:377.377 avs:833)
      leaf: 0x9001547 151000391 (21: row:377.377 avs:833)
      leaf: 0x9000ae3 150997731 (22: row:377.377 avs:833)
      leaf: 0x9000656 150996566 (23: row:377.377 avs:833)
      leaf: 0x900138a 150999946 (24: row:377.377 avs:833)
      leaf: 0x90005b8 150996408 (25: row:377.377 avs:833)
      leaf: 0x900126e 150999662 (26: row:377.377 avs:833)
      leaf: 0x9000c74 150998132 (27: row:377.377 avs:833)
      leaf: 0x9000318 150995736 (28: row:377.377 avs:833)
      leaf: 0x9000160 150995296 (29: row:377.377 avs:833)
      leaf: 0x9001278 150999672 (30: row:377.377 avs:833)
      leaf: 0x90008f3 150997235 (31: row:375.375 avs:871)
      leaf: 0x900101c 150999068 (32: row:47.47 avs:7103)
      leaf: 0x9000921 150997281 (33: row:52.52 avs:7008)
      leaf: 0x9001123 150999331 (34: row:377.377 avs:833)
      leaf: 0x9000138 150995256 (35: row:377.377 avs:833)
      leaf: 0x9000345 150995781 (36: row:377.377 avs:833)
      leaf: 0x90001a4 150995364 (37: row:377.377 avs:833)
      leaf: 0x9000304 150995716 (38: row:377.377 avs:833)
      leaf: 0x900125a 150999642 (39: row:377.377 avs:833)
      leaf: 0x90014ff 151000319 (40: row:377.377 avs:833)
      leaf: 0x9000fa5 150998949 (41: row:377.377 avs:833)
      leaf: 0x9000f9a 150998938 (42: row:377.377 avs:833)
      leaf: 0x900091e 150997278 (43: row:377.377 avs:833)
      leaf: 0x9001153 150999379 (44: row:377.377 avs:833)
      leaf: 0x90004c1 150996161 (45: row:377.377 avs:833)
      leaf: 0x9000fc2 150998978 (46: row:335.335 avs:1631)
      leaf: 0x9000220 150995488 (47: row:69.69 avs:6685)
      leaf: 0x9000fd8 150999000 (48: row:377.377 avs:833)
      leaf: 0x9000168 150995304 (49: row:377.377 avs:833)
      leaf: 0x9001105 150999301 (50: row:377.377 avs:833)
      leaf: 0x9001368 150999912 (51: row:377.377 avs:833)
      leaf: 0x9000358 150995800 (52: row:377.377 avs:833)
      leaf: 0x9001652 151000658 (53: row:82.82 avs:6438)
      leaf: 0x9000125 150995237 (54: row:45.45 avs:7141)
      leaf: 0x90014a9 151000233 (55: row:377.377 avs:833)
      leaf: 0x9000bfa 150998010 (56: row:377.377 avs:833)
      leaf: 0x9000bd7 150997975 (57: row:377.377 avs:833)
      leaf: 0x9000ad6 150997718 (58: row:377.377 avs:833)
      leaf: 0x9000884 150997124 (59: row:377.377 avs:833)
      leaf: 0x9000fcc 150998988 (60: row:377.377 avs:833)
      leaf: 0x90011da 150999514 (61: row:377.377 avs:833)
      leaf: 0x90008eb 150997227 (62: row:377.377 avs:833)
      leaf: 0x9001391 150999953 (63: row:377.377 avs:833)
      leaf: 0x9001104 150999300 (64: row:377.377 avs:833)
      leaf: 0x900043a 150996026 (65: row:377.377 avs:833)
      leaf: 0x9000ebf 150998719 (66: row:215.215 avs:3911)
      leaf: 0x90007cf 150996943 (67: row:68.68 avs:6704)
      leaf: 0x9000e83 150998659 (68: row:377.377 avs:833)
      leaf: 0x90007af 150996911 (69: row:377.377 avs:833)
      leaf: 0x900127a 150999674 (70: row:377.377 avs:833)
      leaf: 0x900166d 151000685 (71: row:377.377 avs:833)
      leaf: 0x90014c9 151000265 (72: row:377.377 avs:833)
      leaf: 0x9000bb4 150997940 (73: row:377.377 avs:833)
      leaf: 0x9000616 150996502 (74: row:377.377 avs:833)
      leaf: 0x90004ff 150996223 (75: row:377.377 avs:833)
      leaf: 0x9001337 150999863 (76: row:283.283 avs:2619)
   branch: 0x900043c 150996028 (3: nrow: 85, level: 1)
      leaf: 0x9000171 150995313 (-1: row:377.377 avs:833)
      leaf: 0x900120f 150999567 (0: row:351.351 avs:1327)
      leaf: 0x9000424 150996004 (1: row:47.47 avs:7103)
      leaf: 0x9000ebc 150998716 (2: row:52.52 avs:7008)
      leaf: 0x9000791 150996881 (3: row:326.326 avs:1802)
      leaf: 0x9000439 150996025 (4: row:71.71 avs:6647)
      leaf: 0x9000e0c 150998540 (5: row:55.55 avs:6951)
      leaf: 0x90007c8 150996936 (6: row:377.377 avs:833)
      leaf: 0x90008b1 150997169 (7: row:377.377 avs:833)
      leaf: 0x9000445 150996037 (8: row:377.377 avs:833)
      leaf: 0x9000777 150996855 (9: row:377.377 avs:833)
      leaf: 0x90001d7 150995415 (10: row:377.377 avs:833)
      leaf: 0x90013ee 151000046 (11: row:377.377 avs:833)
      leaf: 0x900134f 150999887 (12: row:377.377 avs:833)
      leaf: 0x9000e68 150998632 (13: row:377.377 avs:833)
      leaf: 0x9000f91 150998929 (14: row:377.377 avs:833)
      leaf: 0x9001112 150999314 (15: row:317.317 avs:1973)
      leaf: 0x900023f 150995519 (16: row:62.62 avs:6818)
      leaf: 0x9000e85 150998661 (17: row:203.203 avs:4139)
      leaf: 0x90004f3 150996211 (18: row:61.61 avs:6837)
      leaf: 0x9000ee9 150998761 (19: row:377.377 avs:833)
      leaf: 0x900041a 150995994 (20: row:377.377 avs:833)
      leaf: 0x9000724 150996772 (21: row:124.124 avs:5640)
      leaf: 0x9000418 150995992 (22: row:59.59 avs:6875)
      leaf: 0x9000ebe 150998718 (23: row:377.377 avs:833)
      leaf: 0x9000d5f 150998367 (24: row:98.98 avs:6134)
      leaf: 0x9000460 150996064 (25: row:75.75 avs:6571)
      leaf: 0x9000c54 150998100 (26: row:377.377 avs:833)
      leaf: 0x9000f94 150998932 (27: row:377.377 avs:833)
      leaf: 0x9000d36 150998326 (28: row:377.377 avs:833)
      leaf: 0x9000e84 150998660 (29: row:377.377 avs:833)
      leaf: 0x9000ee6 150998758 (30: row:377.377 avs:833)
      leaf: 0x900042e 150996014 (31: row:377.377 avs:833)
      leaf: 0x900073a 150996794 (32: row:377.377 avs:833)
      leaf: 0x90011bc 150999484 (33: row:377.377 avs:833)
      leaf: 0x900020d 150995469 (34: row:377.377 avs:833)
      leaf: 0x9001356 150999894 (35: row:377.377 avs:833)
      leaf: 0x9000e8e 150998670 (36: row:377.377 avs:833)
      leaf: 0x900059c 150996380 (37: row:377.377 avs:833)
      leaf: 0x90004de 150996190 (38: row:377.377 avs:833)
      leaf: 0x9000fed 150999021 (39: row:377.377 avs:833)
      leaf: 0x9000ff7 150999031 (40: row:377.377 avs:833)
      leaf: 0x9000237 150995511 (41: row:377.377 avs:833)
      leaf: 0x9000f8b 150998923 (42: row:318.318 avs:1954)
      leaf: 0x9000494 150996116 (43: row:48.48 avs:7084)
      leaf: 0x90011bf 150999487 (44: row:377.377 avs:833)
      leaf: 0x9001249 150999625 (45: row:128.128 avs:5564)
      leaf: 0x900104b 150999115 (46: row:60.60 avs:6856)
      leaf: 0x90001e3 150995427 (47: row:207.207 avs:4063)
      leaf: 0x9000c18 150998040 (48: row:63.63 avs:6799)
      leaf: 0x900043b 150996027 (49: row:377.377 avs:833)
      leaf: 0x90011de 150999518 (50: row:377.377 avs:833)
      leaf: 0x90007f6 150996982 (51: row:377.377 avs:833)
      leaf: 0x900027c 150995580 (52: row:377.377 avs:833)
      leaf: 0x90001db 150995419 (53: row:377.377 avs:833)
      leaf: 0x9000959 150997337 (54: row:346.346 avs:1422)
      leaf: 0x9000417 150995991 (55: row:50.50 avs:7046)
      leaf: 0x9001350 150999888 (56: row:377.377 avs:833)
      leaf: 0x90004b6 150996150 (57: row:377.377 avs:833)
      leaf: 0x900048e 150996110 (58: row:377.377 avs:833)
      leaf: 0x900049f 150996127 (59: row:377.377 avs:833)
      leaf: 0x90004d8 150996184 (60: row:377.377 avs:833)
      leaf: 0x9000d09 150998281 (61: row:377.377 avs:833)
      leaf: 0x9000ee8 150998760 (62: row:377.377 avs:833)
      leaf: 0x900116c 150999404 (63: row:377.377 avs:833)
      leaf: 0x9000ef0 150998768 (64: row:377.377 avs:833)
      leaf: 0x9000e6c 150998636 (65: row:377.377 avs:833)
      leaf: 0x9000e1b 150998555 (66: row:377.377 avs:833)
      leaf: 0x900079f 150996895 (67: row:377.377 avs:833)
      leaf: 0x900042d 150996013 (68: row:377.377 avs:833)
      leaf: 0x90007b7 150996919 (69: row:377.377 avs:833)
      leaf: 0x900103b 150999099 (70: row:377.377 avs:833)
      leaf: 0x900017e 150995326 (71: row:377.377 avs:833)
      leaf: 0x9000978 150997368 (72: row:377.377 avs:833)
      leaf: 0x9001029 150999081 (73: row:302.302 avs:2258)
      leaf: 0x90007f1 150996977 (74: row:68.68 avs:6704)
      leaf: 0x9000e10 150998544 (75: row:377.377 avs:833)
      leaf: 0x90007d7 150996951 (76: row:377.377 avs:833)
      leaf: 0x9000117 150995223 (77: row:377.377 avs:833)
      leaf: 0x900072f 150996783 (78: row:377.377 avs:833)
      leaf: 0x9000415 150995989 (79: row:245.245 avs:3341)
      leaf: 0x90001f6 150995446 (80: row:51.51 avs:7027)
      leaf: 0x9000e25 150998565 (81: row:377.377 avs:833)
      leaf: 0x900026b 150995563 (82: row:377.377 avs:833)
      leaf: 0x900072b 150996779 (83: row:182.182 avs:4538)
   branch: 0x9000e18 150998552 (4: nrow: 90, level: 1)
      leaf: 0x900017b 150995323 (-1: row:361.361 avs:1137)
      leaf: 0x90007fc 150996988 (0: row:57.57 avs:6913)
      leaf: 0x9000fb2 150998962 (1: row:67.67 avs:6723)
      leaf: 0x9000272 150995570 (2: row:377.377 avs:833)
      leaf: 0x900075a 150996826 (3: row:135.135 avs:5431)
      leaf: 0x90001d2 150995410 (4: row:87.87 avs:6343)
      leaf: 0x9000c6c 150998124 (5: row:193.193 avs:4329)
      leaf: 0x9000327 150995751 (6: row:93.93 avs:6229)
      leaf: 0x9000bb5 150997941 (7: row:377.377 avs:833)
      leaf: 0x900063e 150996542 (8: row:226.226 avs:3702)
      leaf: 0x9000a93 150997651 (9: row:44.44 avs:7160)
      leaf: 0x90014be 151000254 (10: row:377.377 avs:833)
      leaf: 0x9000e2e 150998574 (11: row:377.377 avs:833)
      leaf: 0x9001074 150999156 (12: row:224.224 avs:3740)
      leaf: 0x9000eb6 150998710 (13: row:51.51 avs:7027)
      leaf: 0x9000788 150996872 (14: row:377.377 avs:833)
      leaf: 0x9000428 150996008 (15: row:118.118 avs:5754)
      leaf: 0x9000e28 150998568 (16: row:64.64 avs:6780)
      leaf: 0x90007da 150996954 (17: row:143.143 avs:5279)
      leaf: 0x900025f 150995551 (18: row:54.54 avs:6970)
      leaf: 0x9000d03 150998275 (19: row:354.354 avs:1270)
      leaf: 0x9000e2f 150998575 (20: row:64.64 avs:6780)
      leaf: 0x900017d 150995325 (21: row:64.64 avs:6780)
      leaf: 0x9000d45 150998341 (22: row:377.377 avs:833)
      leaf: 0x9000e31 150998577 (23: row:239.239 avs:3455)
      leaf: 0x9000e05 150998533 (24: row:63.63 avs:6799)
      leaf: 0x9000458 150996056 (25: row:377.377 avs:833)
      leaf: 0x90007c5 150996933 (26: row:377.377 avs:833)
      leaf: 0x900025a 150995546 (27: row:377.377 avs:833)
      leaf: 0x9000c29 150998057 (28: row:377.377 avs:833)
      leaf: 0x9001149 150999369 (29: row:377.377 avs:833)
      leaf: 0x9000732 150996786 (30: row:377.377 avs:833)
      leaf: 0x9000e8d 150998669 (31: row:377.377 avs:833)
      leaf: 0x9000264 150995556 (32: row:377.377 avs:833)
      leaf: 0x90007ec 150996972 (33: row:377.377 avs:833)
      leaf: 0x9001042 150999106 (34: row:377.377 avs:833)
      leaf: 0x9001131 150999345 (35: row:377.377 avs:833)
      leaf: 0x9001125 150999333 (36: row:377.377 avs:833)
      leaf: 0x9000ff9 150999033 (37: row:120.120 avs:5716)
      leaf: 0x9000784 150996868 (38: row:57.57 avs:6913)
      leaf: 0x9000ec6 150998726 (39: row:377.377 avs:833)
      leaf: 0x9000924 150997284 (40: row:377.377 avs:833)
      leaf: 0x900095f 150997343 (41: row:377.377 avs:833)
      leaf: 0x90008cc 150997196 (42: row:377.377 avs:833)
      leaf: 0x90008b3 150997171 (43: row:377.377 avs:833)
      leaf: 0x9000172 150995314 (44: row:377.377 avs:833)
      leaf: 0x90011e5 150999525 (45: row:377.377 avs:833)
      leaf: 0x9000ace 150997710 (46: row:377.377 avs:833)
      leaf: 0x90011af 150999471 (47: row:377.377 avs:833)
      leaf: 0x90005ed 150996461 (48: row:377.377 avs:833)
      leaf: 0x90004e4 150996196 (49: row:377.377 avs:833)
      leaf: 0x9000795 150996885 (50: row:377.377 avs:833)
      leaf: 0x900136f 150999919 (51: row:377.377 avs:833)
      leaf: 0x9001338 150999864 (52: row:377.377 avs:833)
      leaf: 0x9000fce 150998990 (53: row:377.377 avs:833)
      leaf: 0x900127e 150999678 (54: row:377.377 avs:833)
      leaf: 0x90013bc 150999996 (55: row:377.377 avs:833)
      leaf: 0x9001053 150999123 (56: row:130.130 avs:5526)
      leaf: 0x900121d 150999581 (57: row:51.51 avs:7027)
      leaf: 0x90004a6 150996134 (58: row:377.377 avs:833)
      leaf: 0x90004d4 150996180 (59: row:377.377 avs:833)
      leaf: 0x90011be 150999486 (60: row:377.377 avs:833)
      leaf: 0x9001160 150999392 (61: row:377.377 avs:833)
      leaf: 0x900072c 150996780 (62: row:377.377 avs:833)
      leaf: 0x9000d6a 150998378 (63: row:377.377 avs:833)
      leaf: 0x900075e 150996830 (64: row:377.377 avs:833)
      leaf: 0x900047d 150996093 (65: row:377.377 avs:833)
      leaf: 0x9000d5c 150998364 (66: row:377.377 avs:833)
      leaf: 0x900074c 150996812 (67: row:377.377 avs:833)
      leaf: 0x9000449 150996041 (68: row:377.377 avs:833)
      leaf: 0x9000e90 150998672 (69: row:377.377 avs:833)
      leaf: 0x900027b 150995579 (70: row:377.377 avs:833)
      leaf: 0x9001012 150999058 (71: row:377.377 avs:833)
      leaf: 0x9000ec7 150998727 (72: row:377.377 avs:833)
      leaf: 0x900104e 150999118 (73: row:377.377 avs:833)
      leaf: 0x9000709 150996745 (74: row:377.377 avs:833)
      leaf: 0x900027a 150995578 (75: row:377.377 avs:833)
      leaf: 0x90007bb 150996923 (76: row:81.81 avs:6457)
      leaf: 0x9000f99 150998937 (77: row:47.47 avs:7103)
      leaf: 0x90001ea 150995434 (78: row:377.377 avs:833)
      leaf: 0x900072a 150996778 (79: row:377.377 avs:833)
      leaf: 0x9000429 150996009 (80: row:284.284 avs:2600)
      leaf: 0x9000e96 150998678 (81: row:52.52 avs:7008)
      leaf: 0x900071a 150996762 (82: row:377.377 avs:833)
      leaf: 0x9000273 150995571 (83: row:377.377 avs:833)
      leaf: 0x9000749 150996809 (84: row:148.148 avs:5184)
      leaf: 0x90001da 150995418 (85: row:65.65 avs:6761)
      leaf: 0x9000ef5 150998773 (86: row:377.377 avs:833)
      leaf: 0x9000257 150995543 (87: row:377.377 avs:833)
      leaf: 0x90007e6 150996966 (88: row:133.133 avs:5469)
   branch: 0x900073d 150996797 (5: nrow: 88, level: 1)
      leaf: 0x9000143 150995267 (-1: row:377.377 avs:833)
      leaf: 0x9000ef7 150998775 (0: row:377.377 avs:833)
      leaf: 0x9000a2c 150997548 (1: row:377.377 avs:833)
      leaf: 0x9000356 150995798 (2: row:377.377 avs:833)
      leaf: 0x9000108 150995208 (3: row:377.377 avs:833)
      leaf: 0x9001035 150999093 (4: row:377.377 avs:833)
      leaf: 0x9001033 150999091 (5: row:199.199 avs:4215)
      leaf: 0x90001c9 150995401 (6: row:51.51 avs:7027)
      leaf: 0x900155a 151000410 (7: row:377.377 avs:833)
      leaf: 0x90008bc 150997180 (8: row:322.322 avs:1878)
      leaf: 0x9000174 150995316 (9: row:68.68 avs:6704)
      leaf: 0x900101d 150999069 (10: row:377.377 avs:833)
      leaf: 0x9001389 150999945 (11: row:377.377 avs:833)
      leaf: 0x90011b0 150999472 (12: row:377.377 avs:833)
      leaf: 0x90013b5 150999989 (13: row:377.377 avs:833)
      leaf: 0x9000139 150995257 (14: row:377.377 avs:833)
      leaf: 0x9000a9c 150997660 (15: row:377.377 avs:833)
      leaf: 0x9000652 150996562 (16: row:377.377 avs:833)
      leaf: 0x9000bdc 150997980 (17: row:377.377 avs:833)
      leaf: 0x90011a6 150999462 (18: row:377.377 avs:833)
      leaf: 0x9001256 150999638 (19: row:377.377 avs:833)
      leaf: 0x9001535 151000373 (20: row:377.377 avs:833)
      leaf: 0x9001358 150999896 (21: row:377.377 avs:833)
      leaf: 0x9000590 150996368 (22: row:329.329 avs:1745)
      leaf: 0x900154d 151000397 (23: row:48.48 avs:7084)
      leaf: 0x90005b9 150996409 (24: row:377.377 avs:833)
      leaf: 0x9001355 150999893 (25: row:377.377 avs:833)
      leaf: 0x900105d 150999133 (26: row:377.377 avs:833)
      leaf: 0x90005c3 150996419 (27: row:377.377 avs:833)
      leaf: 0x9000585 150996357 (28: row:377.377 avs:833)
      leaf: 0x9000485 150996101 (29: row:369.369 avs:985)
      leaf: 0x900091a 150997274 (30: row:48.48 avs:7084)
      leaf: 0x9001113 150999315 (31: row:377.377 avs:833)
      leaf: 0x90005c6 150996422 (32: row:377.377 avs:833)
      leaf: 0x9000486 150996102 (33: row:377.377 avs:833)
      leaf: 0x9000ab0 150997680 (34: row:377.377 avs:833)
      leaf: 0x9000abc 150997692 (35: row:377.377 avs:833)
      leaf: 0x90013ec 151000044 (36: row:338.338 avs:1574)
      leaf: 0x900058a 150996362 (37: row:72.72 avs:6628)
      leaf: 0x9000bca 150997962 (38: row:54.54 avs:6970)
      leaf: 0x90014d0 151000272 (39: row:377.377 avs:833)
      leaf: 0x900123f 150999615 (40: row:60.60 avs:6856)
      leaf: 0x9000377 150995831 (41: row:44.44 avs:7160)
      leaf: 0x9001307 150999815 (42: row:377.377 avs:833)
      leaf: 0x9001062 150999138 (43: row:377.377 avs:833)
      leaf: 0x900034c 150995788 (44: row:377.377 avs:833)
      leaf: 0x900019d 150995357 (45: row:377.377 avs:833)
      leaf: 0x9001392 150999954 (46: row:377.377 avs:833)
      leaf: 0x9000c6a 150998122 (47: row:377.377 avs:833)
      leaf: 0x9000a03 150997507 (48: row:377.377 avs:833)
      leaf: 0x9000ac3 150997699 (49: row:377.377 avs:833)
      leaf: 0x9000c3f 150998079 (50: row:142.142 avs:5298)
      leaf: 0x9001566 151000422 (51: row:61.61 avs:6837)
      leaf: 0x9000a82 150997634 (52: row:377.377 avs:833)
      leaf: 0x9000355 150995797 (53: row:377.377 avs:833)
      leaf: 0x900130c 150999820 (54: row:64.64 avs:6780)
      leaf: 0x9000186 150995334 (55: row:58.58 avs:6894)
      leaf: 0x9000c76 150998134 (56: row:168.168 avs:4804)
      leaf: 0x9000e56 150998614 (57: row:48.48 avs:7084)
      leaf: 0x9000471 150996081 (58: row:377.377 avs:833)
      leaf: 0x9000a4e 150997582 (59: row:321.321 avs:1897)
      leaf: 0x900127c 150999676 (60: row:62.62 avs:6818)
      leaf: 0x9000a19 150997529 (61: row:40.40 avs:7236)
      leaf: 0x90014d8 151000280 (62: row:377.377 avs:833)
      leaf: 0x90005ce 150996430 (63: row:377.377 avs:833)
      leaf: 0x9000582 150996354 (64: row:377.377 avs:833)
      leaf: 0x900119f 150999455 (65: row:377.377 avs:833)
      leaf: 0x900132d 150999853 (66: row:377.377 avs:833)
      leaf: 0x9000a77 150997623 (67: row:183.183 avs:4519)
      leaf: 0x9000ac2 150997698 (68: row:34.34 avs:7350)
      leaf: 0x900137f 150999935 (69: row:331.331 avs:1707)
      leaf: 0x9000fbb 150998971 (70: row:51.51 avs:7027)
      leaf: 0x90008f6 150997238 (71: row:61.61 avs:6837)
      leaf: 0x9001026 150999078 (72: row:377.377 avs:833)
      leaf: 0x9001230 150999600 (73: row:377.377 avs:833)
      leaf: 0x9001224 150999588 (74: row:377.377 avs:833)
      leaf: 0x900103f 150999103 (75: row:377.377 avs:833)
      leaf: 0x90011c1 150999489 (76: row:377.377 avs:833)
      leaf: 0x90005f3 150996467 (77: row:377.377 avs:833)
      leaf: 0x90005fc 150996476 (78: row:377.377 avs:833)
      leaf: 0x90004ac 150996140 (79: row:377.377 avs:833)
      leaf: 0x9000bb0 150997936 (80: row:377.377 avs:833)
      leaf: 0x9000972 150997362 (81: row:377.377 avs:833)
      leaf: 0x9000adf 150997727 (82: row:377.377 avs:833)
      leaf: 0x9000ab3 150997683 (83: row:377.377 avs:833)
      leaf: 0x9000bd6 150997974 (84: row:377.377 avs:833)
      leaf: 0x9000650 150996560 (85: row:377.377 avs:833)
      leaf: 0x9000607 150996487 (86: row:155.155 avs:5051)
   branch: 0x9000c68 150998120 (6: nrow: 101, level: 1)
      leaf: 0x90001b2 150995378 (-1: row:362.362 avs:1118)
      leaf: 0x9000329 150995753 (0: row:77.77 avs:6533)
      leaf: 0x9000bd5 150997973 (1: row:377.377 avs:833)
      leaf: 0x900066d 150996589 (2: row:377.377 avs:833)
      leaf: 0x90005c5 150996421 (3: row:377.377 avs:833)
      leaf: 0x9000497 150996119 (4: row:163.163 avs:4899)
      leaf: 0x9001188 150999432 (5: row:54.54 avs:6970)
      leaf: 0x9000350 150995792 (6: row:377.377 avs:833)
      leaf: 0x900011a 150995226 (7: row:377.377 avs:833)
      leaf: 0x900153c 151000380 (8: row:377.377 avs:833)
      leaf: 0x900062e 150996526 (9: row:377.377 avs:833)
      leaf: 0x900064f 150996559 (10: row:377.377 avs:833)
      leaf: 0x90005e7 150996455 (11: row:377.377 avs:833)
      leaf: 0x9000482 150996098 (12: row:377.377 avs:833)
      leaf: 0x900066f 150996591 (13: row:377.377 avs:833)
      leaf: 0x9000bf1 150998001 (14: row:377.377 avs:833)
      leaf: 0x900131e 150999838 (15: row:377.377 avs:833)
      leaf: 0x90004a3 150996131 (16: row:377.377 avs:833)
      leaf: 0x90004a8 150996136 (17: row:377.377 avs:833)
      leaf: 0x9000a5b 150997595 (18: row:377.377 avs:833)
      leaf: 0x90008ca 150997194 (19: row:377.377 avs:833)
      leaf: 0x9000224 150995492 (20: row:377.377 avs:833)
      leaf: 0x90007bf 150996927 (21: row:281.281 avs:2657)
      leaf: 0x900124e 150999630 (22: row:47.47 avs:7103)
      leaf: 0x90005af 150996399 (23: row:181.181 avs:4557)
      leaf: 0x9001388 150999944 (24: row:54.54 avs:6970)
      leaf: 0x9000a36 150997558 (25: row:377.377 avs:833)
      leaf: 0x9000e65 150998629 (26: row:377.377 avs:833)
      leaf: 0x9000e67 150998631 (27: row:377.377 avs:833)
      leaf: 0x90011a5 150999461 (28: row:377.377 avs:833)
      leaf: 0x9001005 150999045 (29: row:377.377 avs:833)
      leaf: 0x9000455 150996053 (30: row:377.377 avs:833)
      leaf: 0x9001376 150999926 (31: row:377.377 avs:833)
      leaf: 0x90008a6 150997158 (32: row:377.377 avs:833)
      leaf: 0x900095b 150997339 (33: row:377.377 avs:833)
      leaf: 0x9001060 150999136 (34: row:377.377 avs:833)
      leaf: 0x9001138 150999352 (35: row:377.377 avs:833)
      leaf: 0x90007dc 150996956 (36: row:292.292 avs:2448)
      leaf: 0x9000eff 150998783 (37: row:69.69 avs:6685)
      leaf: 0x900040b 150995979 (38: row:377.377 avs:833)
      leaf: 0x90005f1 150996465 (39: row:377.377 avs:833)
      leaf: 0x9001127 150999335 (40: row:320.320 avs:1916)
      leaf: 0x9000af2 150997746 (41: row:74.74 avs:6590)
      leaf: 0x9000588 150996360 (42: row:35.35 avs:7331)
      leaf: 0x90014a4 151000228 (43: row:377.377 avs:833)
      leaf: 0x9000ae8 150997736 (44: row:377.377 avs:833)
      leaf: 0x9001164 150999396 (45: row:377.377 avs:833)
      leaf: 0x9001155 150999381 (46: row:377.377 avs:833)
      leaf: 0x90004ad 150996141 (47: row:179.179 avs:4595)
      leaf: 0x900035e 150995806 (48: row:37.37 avs:7293)
      leaf: 0x900116f 150999407 (49: row:139.139 avs:5355)
      leaf: 0x90005a5 150996389 (50: row:45.45 avs:7141)
      leaf: 0x90014db 151000283 (51: row:377.377 avs:833)
      leaf: 0x9001176 150999414 (52: row:377.377 avs:833)
      leaf: 0x90004da 150996186 (53: row:377.377 avs:833)
      leaf: 0x90004e6 150996198 (54: row:377.377 avs:833)
      leaf: 0x9000ee7 150998759 (55: row:377.377 avs:833)
      leaf: 0x9000e5c 150998620 (56: row:377.377 avs:833)
      leaf: 0x900021a 150995482 (57: row:377.377 avs:833)
      leaf: 0x90008a9 150997161 (58: row:340.340 avs:1536)
      leaf: 0x90004b1 150996145 (59: row:64.64 avs:6780)
      leaf: 0x90011f8 150999544 (60: row:111.111 avs:5887)
      leaf: 0x900157e 151000446 (61: row:42.42 avs:7198)
      leaf: 0x900021f 150995487 (62: row:377.377 avs:833)
      leaf: 0x9001270 150999664 (63: row:230.230 avs:3626)
      leaf: 0x900117e 150999422 (64: row:58.58 avs:6894)
      leaf: 0x900048a 150996106 (65: row:377.377 avs:833)
      leaf: 0x90004be 150996158 (66: row:377.377 avs:833)
      leaf: 0x90005f8 150996472 (67: row:291.291 avs:2467)
      leaf: 0x9000aaf 150997679 (68: row:72.72 avs:6628)
      leaf: 0x9000595 150996373 (69: row:377.377 avs:833)
      leaf: 0x90014ab 151000235 (70: row:163.163 avs:4899)
      leaf: 0x9000344 150995780 (71: row:54.54 avs:6970)
      leaf: 0x90013ef 151000047 (72: row:377.377 avs:833)
      leaf: 0x9000971 150997361 (73: row:377.377 avs:833)
      leaf: 0x9000922 150997282 (74: row:377.377 avs:833)
      leaf: 0x900090f 150997263 (75: row:377.377 avs:833)
      leaf: 0x9000b82 150997890 (76: row:134.134 avs:5450)
      leaf: 0x900062b 150996523 (77: row:78.78 avs:6514)
      leaf: 0x9000ba5 150997925 (78: row:377.377 avs:833)
      leaf: 0x9001488 151000200 (79: row:377.377 avs:833)
      leaf: 0x9001212 150999570 (80: row:377.377 avs:833)
      leaf: 0x9000ba0 150997920 (81: row:324.324 avs:1840)
      leaf: 0x9000c41 150998081 (82: row:81.81 avs:6457)
      leaf: 0x9000676 150996598 (83: row:84.84 avs:6400)
      leaf: 0x9000c75 150998133 (84: row:244.244 avs:3360)
      leaf: 0x9000610 150996496 (85: row:82.82 avs:6438)
      leaf: 0x9000c33 150998067 (86: row:377.377 avs:833)
      leaf: 0x9000bd9 150997977 (87: row:198.198 avs:4234)
      leaf: 0x9000c6b 150998123 (88: row:79.79 avs:6495)
      leaf: 0x900014d 150995277 (89: row:294.294 avs:2410)
      leaf: 0x9000475 150996085 (90: row:80.80 avs:6476)
      leaf: 0x9000d2b 150998315 (91: row:161.161 avs:4937)
      leaf: 0x9000c11 150998033 (92: row:74.74 avs:6590)
      leaf: 0x9000275 150995573 (93: row:377.377 avs:833)
      leaf: 0x9000c5c 150998108 (94: row:377.377 avs:833)
      leaf: 0x900064e 150996558 (95: row:377.377 avs:833)
      leaf: 0x9000c57 150998103 (96: row:310.310 avs:2106)
      leaf: 0x9000c42 150998082 (97: row:61.61 avs:6837)
      leaf: 0x9000651 150996561 (98: row:377.377 avs:833)
      leaf: 0x9000bbe 150997950 (99: row:94.94 avs:6210)
----- end tree dump

shrink space compact

When we switch to “alter index shrink space compact” (which is the version that doesn’t lower the index highwater mark), the first striking difference appears in the dbms_space report:

Unformatted                   :           62 /          507,904
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            1 /            8,192
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :        3,045 /       24,944,640
Full                          :          544 /        4,456,448

PL/SQL procedure successfully completed.

Segment Total blocks:        3,712
Object Unused blocks:            0

Basically we had 3,000 blocks reported as Freespace 2 after a coalesce, but now we see those blocks reported as Freespace 4. Are they in the index structure, have they been unlinked, and is the undo/redo going to show anything significantly different because of this change.

In a single stream, here are the things we need to cross-reference to get a better view of what Oracle has done. Some critical redo stats, the undo segment stats and the report of enqueue requests:

Name                                                                     Value
----                                                                     -----
redo entries                                                            47,521
redo size                                                           85,285,860
redo buffer allocation retries                                              13
undo change vector size                                             59,344,340
rollback changes - undo records applied                                    447


USN   Ex Size K  HWM K  Opt K      Writes     Gets  Waits Shr Grow Shr K  Act K
----  -- ------  -----  -----      ------     ----  ----- --- ---- ----- ------
   0   0      0      0      0           0        1      0   0    0     0      0
   1   0      0      0      0           0       25      0   0    0     0      0
   2   0      0      0      0           0       25      0   0    0     0      0
   3   0      0      0      0           0       25      0   0    0     0      0
   4 -30 -29760      0      0         328       75      0   3    0    55      0
   5   0      0      0      0           0       25      0   0    0     0      0
   6   0      0      0      0           0       25      0   0    0     0      0
   7   0      0      0      0           0       25      0   0    0     0      0
   8 -41 -37952      0      0         410       96      0   5    0    65      0
   9 104  85056      0      0    58635908    14524      0   0  104     0 148492
  10   0      0      0      0           0       25      0   0    0     0      0


Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
CF             2           0           2           0           0 contention
CR           710          17         710           0           2 block range reuse ckpt
IS            71           0          71           0           0 contention
XR             1           0           1           0           0 database force logging
TM             1           0           1           0           0 contention
TX         3,567           0       3,567           0           0 contention
US           116           0         116           0           0 contention
HW           348           0         348           0           0 contention
SK             1           0           1           0           0 contention
TT           232           0         232           0           0 contention
SJ             2           0           2           0           0 Slave Task Cancel
CU             1           0           1           0           0 contention
JG           357           0         357           0           0 queue lock
JG            34           0          34           0           0 q mem clnup lck
JG           357           0         357           0           0 contention

The redo requirement has increased from 34,800 enries and 76MB to 47,500 entries and 85MB; aided by an increase of 4MB in the undo. Cross-checking to the undo segment stats (v$rollstat) we see an alarming difference – the volume of writes agrees with the session stats, but almost all of it takes place on one undo segment; that could be really nasty if it means the whole shrink is performed as a single transaction!

Luckily we can see in the enqueue stats that we still have a large number of transaction (TX) enqueues, though the number has gone up from about 3,000 to 3,500. (That’s an interesting difference given the “shrunk” index consists of about 500 leaf blocks – and while we’re thinking about that, it might be interesting that 4MB of undo seems to be approximately 500 (leaf?) blocks * 8KB!)

Let’s take a look at the before and after versions of the treedump. Because I was using a clean tablespace to re-run the tests, and because I managed to keep restarting on the same v$process.pid for many of the tests the order in which I used data blocks was unchanged from test to test, so the index for the “compact” test started out exactly the same as it was for the “coalesce” test:

Before
branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 401, level: 1)
      leaf: 0x900016d 150995309 (-1: row:222.47 avs:3778)
      leaf: 0x900154e 151000398 (0: row:218.52 avs:3854)
      leaf: 0x9000abd 150997693 (1: row:219.44 avs:3835)
      leaf: 0x900153e 151000382 (2: row:209.43 avs:4025)
      leaf: 0x900058d 150996365 (3: row:230.44 avs:3626)
      leaf: 0x90013a8 150999976 (4: row:229.45 avs:3645)
      leaf: 0x9000ae1 150997729 (5: row:411.88 avs:187)
      leaf: 0x900031c 150995740 (6: row:227.50 avs:3683)
      leaf: 0x90014d3 151000275 (7: row:229.42 avs:3645)
      leaf: 0x9000aec 150997740 (8: row:226.46 avs:3702)
      leaf: 0x90014f3 151000307 (9: row:226.57 avs:3702)
      leaf: 0x9000593 150996371 (10: row:219.46 avs:3835)
      leaf: 0x9001559 151000409 (11: row:223.54 avs:3759)
      leaf: 0x9000a9d 150997661 (12: row:210.33 avs:4006)
      leaf: 0x900152e 151000366 (13: row:215.30 avs:3911)
      leaf: 0x900018a 150995338 (14: row:258.52 avs:3094)

After
branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000427 150996007 (-1: nrow: 64, level: 1)
      leaf: 0x900016d 150995309 (-1: row:377.377 avs:833)
      leaf: 0x900011b 150995227 (0: row:377.377 avs:833)
      leaf: 0x900016e 150995310 (1: row:377.377 avs:833)
      leaf: 0x9000370 150995824 (2: row:377.377 avs:833)
      leaf: 0x900011e 150995230 (3: row:377.377 avs:833)
      leaf: 0x9000309 150995721 (4: row:377.377 avs:833)
      leaf: 0x9000239 150995513 (5: row:377.377 avs:833)
      leaf: 0x90001eb 150995435 (6: row:377.377 avs:833)
      leaf: 0x90001d3 150995411 (7: row:377.377 avs:833)
      leaf: 0x9000197 150995351 (8: row:377.377 avs:833)
      leaf: 0x900031f 150995743 (9: row:377.377 avs:833)
      leaf: 0x9000369 150995817 (10: row:377.377 avs:833)
      leaf: 0x9000216 150995478 (11: row:377.377 avs:833)
      leaf: 0x9000332 150995762 (12: row:377.377 avs:833)
      leaf: 0x900023b 150995515 (13: row:377.377 avs:833)
      leaf: 0x900033a 150995770 (14: row:377.377 avs:833)

A particularly interesting detail of the “after compact” treedump appears in the first (highlighted) level 1 branch block: it has changed its address (even though its first leaf block hasn’t).

A less obvious detail in the after compact extract is that none of the leaf block addresses is particularly large. Before compacting some of the block addresses ended with 4 non-zero digits, after compacting the highest block address we can see is 0x9000370 with only 3 trailing non-zero digits. And if we sort all the leaf blocks by block address we see the following:

      leaf: 0x9000105 150995205 (59: row:377.377 avs:833)
      leaf: 0x9000106 150995206 (37: row:377.377 avs:833)
      leaf: 0x9000107 150995207 (5: row:377.377 avs:833)
      leaf: 0x9000108 150995208 (3: row:377.377 avs:833)
      leaf: 0x9000109 150995209 (29: row:377.377 avs:833)
      leaf: 0x900010a 150995210 (52: row:377.377 avs:833)
      leaf: 0x900010b 150995211 (35: row:377.377 avs:833)
      leaf: 0x900010c 150995212 (53: row:377.377 avs:833)
      leaf: 0x900010d 150995213 (28: row:377.377 avs:833)
      leaf: 0x900010e 150995214 (60: row:377.377 avs:833)
      leaf: 0x900010f 150995215 (10: row:377.377 avs:833)
...
      leaf: 0x9000428 150996008 (8: row:377.377 avs:833)
      leaf: 0x9000429 150996009 (20: row:377.377 avs:833)
      leaf: 0x900042a 150996010 (19: row:377.377 avs:833)
      leaf: 0x900042b 150996011 (8: row:377.377 avs:833)
      leaf: 0x900042c 150996012 (55: row:377.377 avs:833)

With a few small gaps for the space management blocks and a couple of big jumps where table extents have been allocated between index extents, we can see a completely contiguous array of blocks. (And this “shuffling” of blocks explains most of the extra undo, redo and transaction count.)

There is the question, of course, of whether Oracle does all the back-filling before rearranging the blocks, or whether it relocates a block as soon as it has filled it. There’s a fairly big hint in the quote from the manuals that said: “Concurrent DML operations are blocked for a short time at the end of the shrink operation when the space is deallocated” but we can answer the question fairly easily by looking at the redo log that we’ve dumped.

If we use grep to pick out just the index-related OP Codes (layer 10) and the 5.4 (commit) OP codes and look at the last few lines of the result we can spot a repeating pattern like the following (which I’ve edited to shorten the lines):

DBA:0x09000c24 OBJ:143721 SCN:0x0000000002542c13 SEQ:1 OP:10.6 
DBA:0x09001674 OBJ:143721 SCN:0x0000000002542c13 SEQ:1 OP:10.6 
DBA:0x09000105 OBJ:143721 SCN:0x0000000002542c13 SEQ:1 OP:10.8 
DBA:0x0900033c OBJ:143721 SCN:0x0000000002542c13 SEQ:1 OP:10.11 
DBA:0x09000c24 OBJ:143721 SCN:0x0000000002542c13 SEQ:2 OP:10.11 
DBA:0x09000e5b OBJ:143721 SCN:0x0000000002542c13 SEQ:1 OP:10.40 
DBA:0x09001674 OBJ:143721 SCN:0x0000000002542c13 SEQ:2 OP:10.34 
DBA:0x04402720 OBJ:4294967295 SCN:0x0000000002542c13 SEQ:2 OP:5.4 

This, plus a few variations, happens 441 times at the end of the dump. The OP Codes translate to:

  • Lock block 0x09000c24 (10.6)
  • Lock block 0x09001674 (10.6)
  • Initialize leaf block 0x09000105 (10.8) — copying block 0x009001674
  • Set previous pointer on leaf block 0x0900033c (10.11) — to point backwards to 0x009000105
  • Set next pointer on leaf block 0x09000c24 (10.11) — to point forwards to 0x009000105
  • Update branch block 0x009000e5b (10.40) — to point to 0x009000105 instead of 0x009001674
  • Make empty leaf block 0x09001674 (10.34)
  • commit (5.4)

This pattern shows Oracle copying leaf blocks to the start of the segment and wiping the original clean (the small number of variants are for branch blocks). And this happens only after the task of filling all the leaf blocks to the correct level has finished.

As before it’s the initialize (10.8) and the undo (5.1) of the “make empty” (10.34) which generate the largest amount of redo – the 10.8 being effectively an array insert of 377 rows, and the 5.1 being an 8KB block image.

There are other (relatively small) differences, though between the redo log dump for coalesce and compact. We noted that the 3,000 “empty” blocks were marked as FS2 for the coalesce but FS4 for the compact. When we compare the OP Code 13.22 (state change for Level 1 BMB) we find that the coalesce reported roughly 3,000 of them them with the specific action “Mark Block free”; but compact reported 6,900 of them of which 3,900 reports “Mark Block free” and 3,027 of them reported “State Change”. That’s an interesting difference when the space management report tells us that there were 3,027 blocks at FS4 when I had been expecting them to be FS2.

A little detail on the side – the sort of thing that goes into a note for further investigation – is that the redo for the compact reported 8 cases of Op Code 13.22 redo change vectors as “Update Express Allocation Area” in the first level 1 bitmap of the segment for Allocate Area slots 0 to 7 respectively, and had one 13.28 Op Code (Update segment header block) with the same description, supposedly marking the Allocation Area “Full”.

A little extra work with grep (looking for “OP:13.22” or “state:”, and then looking at a few lines of the raw trace very closely, then doing a couple more “grep”s led me to the following summary report:

egrep -n -i  -e "offset:.*state:" -e"state:" *compact*redo*.trc

...
101824:offset: 10 length:1 xidslot:35 state:3
102992:offset: 34 length:1 xidslot:35 state:3
104239:offset: 38 length:1 xidslot:35 state:3
105544:offset: 46 length:1 xidslot:35 state:3
106876:offset: 15 length:1 xidslot:35 state:3
108485:offset: 55 length:1 xidslot:35 state:3
109678:offset: 49 length:1 xidslot:35 state:3
110960:offset: 16 length:1 xidslot:35 state:3
...
3889253:Len: 1 Offset: 63 newstate: 2
3889261:Len: 1 Offset: 62 newstate: 2
3889269:Len: 1 Offset: 61 newstate: 2
3889283:Len: 1 Offset: 60 newstate: 2
3889297:Len: 1 Offset: 59 newstate: 2
3889305:Len: 1 Offset: 58 newstate: 2
3889313:Len: 1 Offset: 57 newstate: 2
3889327:Len: 1 Offset: 56 newstate: 2
3889341:Len: 1 Offset: 55 newstate: 2
3889349:Len: 1 Offset: 54 newstate: 2
3889357:Len: 1 Offset: 53 newstate: 2
3890375:offset: 52 length:1 xidslot:35 state:3
3890387:Len: 1 Offset: 52 newstate: 2
3890400:Len: 1 Offset: 51 newstate: 2
...

The early part of the trace file shows the bitmap changes as index entries are transferred to the (logically) previous index leaf block and a leaf block becomes empty: its “bit” is set to state 3 (which corresponds to FS2). The offset is the location of the “bit” in the space management block and in some cases a range of bits can be set in one call, hence the presence of the length. I haven’t shown you the DBA (block address) of the OP:13.22 which jumps all over the place as Oracle walks the index in key order, but that explains the randomness of the offset – with my 1MB extent definition each bitmap block maps 64 consecutive data blocks from the 128 available to the extent and (logically) consecutive leaf blocks could be in different extents.

In the later part of the trace file – once the rows have been packed into the minimum number of leaf blocks – Oracle starts walking the index “backwards” in “physical” – i.e. from the last block of the last used extrent – again you really need to see some way of viewing DBAs but there’s a hint in the way the offset decreases from 63 to zero.

If Oracle finds an empty (state 3) block it changes it to state 2 (which corresponds to FS4). If it finds a leaf block that is not empty (highlighted lines) it goes through the steps described above to “create full block as near as possible to the start of the segment” / “create empty block here” and flags the empty block as state 3 – and then changes the state from 3 to 2. (You don’t see a state change for the block created near the start of segment as it will be overwriting a pre-existing “full” block. The block changes, the bit doesn’t have to.)

The DBAs you can dump are the ones for all the OP:10.8 (initialize new block). In the earlier part of the trace you’ll see the DBA’s in these records jumping about randomly as Oracle walks the index in key order (each one may appear several times in a row as several consecutive leaf blocks may empty themselves backwards into a single leaf block – which is what happened after my 80% deletion). In the later part of the trace file it will probably be fairly easy to see that the “new” blocks are working forwards from the start of the segment. Here’s a little bit of the dump from the later part of my trace file:

3889458:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000105 OBJ:143721 SCN:0x0000000002542c13 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3890921:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000107 OBJ:143721 SCN:0x0000000002542c60 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3892086:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000108 OBJ:143721 SCN:0x0000000002542c73 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3893167:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000109 OBJ:143721 SCN:0x0000000002542c7a SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3894276:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900010a OBJ:143721 SCN:0x0000000002542c85 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3895861:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900010b OBJ:143721 SCN:0x0000000002542cd4 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3896968:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900010c OBJ:143721 SCN:0x0000000002542cdd SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3898133:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900010d OBJ:143721 SCN:0x0000000002542cf0 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3899214:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900010e OBJ:143721 SCN:0x0000000002542cf7 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3900295:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000110 OBJ:143721 SCN:0x0000000002542cfe SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3901348:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000111 OBJ:143721 SCN:0x0000000002542d01 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3902555:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000112 OBJ:143721 SCN:0x0000000002542d1a SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
3903608:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000114 OBJ:143721 SCN:0x0000000002542d1d SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000

So we can see that blocks made empty by the block-shuffling of a “shrink space” on an index are flagged as “state 2”, which then gets reported by dbms_space.space_usage() as FS4, i.e. “75 – 100% free”. It seems a little odd that this should be the case (especially when (a) coalesce and deletes use state 3 / FS2 and (b) shrink space compact marks them as state 3 before immediately changing them to state 2. Possibly, though, this is a little trick to avoid the risk of error when Oracle tries to reduce the highwater mark on a “shrink space”, or to avoid repeating work if this phase of the operation is interrupted and has to be interrupted.

Note, however, another little performance threat indicated by this processing. Oracle walks the index in key order to collapse the contents of multiple leaf blocks back to an existing single leaf block. rows; then it re-reads the index in reverse order of extent_id (and block_id within extent – where necessary) as it moves blocks from the high end of the segment to the low end of the segment. For a very large index you may end up physically reading most of it twice, one block at a time.

Shrink space compact – concurrency

The manuals tell us that the “compact” option of shrink space allows the command to complete “online” – i.e. with other activity ongoing. How accurate is this description? Let’s just re-run the test but insert a few randomly scattered values from another session without committing before we start the shrink, and see what happens:

My test data is tiny and my laptop is fairly high powered, so the session doing the shrinking seemed to hang almost instantly, going into a wait for a TX enqueue, timing out every three seconds. When I commited from the second session the shrink finished almost instantly – so three possibilities:

  • It had waited for all TM locks to drop before starting work
  • It had hit the first leaf block with an active transaction and waited for that transaction to commit before continuing
  • It had skipped over any leaf block with an active transaction and only gone into a TX wait on the “phase 2” when it was trying to move leaf blocks towards the start of the segment.

Fortunately I had done a treedump of the index before committing, with the following results (first few lines only):

branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x900042b 150996011 (-1: nrow: 81, level: 1)
      leaf: 0x9000167 150995303 (-1: row:377.377 avs:833)
      leaf: 0x9000126 150995238 (0: row:377.377 avs:833)
      leaf: 0x90001b8 150995384 (1: row:377.377 avs:833)
      leaf: 0x9000354 150995796 (2: row:242.242 avs:3398)
      leaf: 0x9000139 150995257 (3: row:45.45 avs:7141)
      leaf: 0x900034d 150995789 (4: row:377.377 avs:833)
      leaf: 0x9000337 150995767 (5: row:377.377 avs:833)
...

Most of the index had been through the row-packing process, but there were clear indications that Oracle had handled a few leaf blocks differently (lines 6 & 7). Moreover there were some very strange leaf blocks reporting “row:0.0” – all of which had a “high” block address:

      leaf: 0x9000335 150995765 (64: row:377.377 avs:833)
      leaf: 0x900015e 150995294 (65: row:308.308 avs:2144)
      leaf: 0x9000148 150995272 (66: row:50.50 avs:7046)
      leaf: 0x9000a36 150997558 (67: row:0.0 avs:7996)
      leaf: 0x9000356 150995798 (68: row:37.37 avs:7293)
      leaf: 0x900014e 150995278 (69: row:377.377 avs:833)

After I’d committed the second session and the shrink space compact had completed this portion of the index changed to:

      leaf: 0x9000335 150995765 (64: row:377.377 avs:833)
      leaf: 0x900015e 150995294 (65: row:308.308 avs:2144)
      leaf: 0x9000148 150995272 (66: row:50.50 avs:7046)
      leaf: 0x9000356 150995798 (67: row:37.37 avs:7293)
      leaf: 0x900014e 150995278 (68: row:377.377 avs:833)

So it seems that Oracle works its way through the entire compaction process but leaves an empty block for each leaf block above the “anticipated” highwater mark (to allow for easy read-consistency, perhaps) and then waits for each transaction that is holding one of those blocks to commit before removing them from the index structure.

We can do one last check to see if this hypothesis is roughly what happens by looking at the redo log dump and checking timestamps to see when the big TX wait stops and what happens after it is over – and here are some lines from the redo dump showing the last few OP:13.22 codes with their associated timestamps and line numbers

4364620:CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09000480 OBJ:143917 SCN:0x00000000025609a6 SEQ:1 OP:13.22 ENC:0 RBL:0 FLG:0x0000
4364629:SCN: 0x00000000025609a7 SUBSCN:  1 09/05/2022 12:15:38
4364632:CHANGE #2 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09000480 OBJ:143917 SCN:0x00000000025609a6 SEQ:2 OP:13.22 ENC:0 RBL:0 FLG:0x0000
4364638:SCN: 0x00000000025609a7 SUBSCN:  1 09/05/2022 12:15:38
...
4373099:CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09001100 OBJ:143917 SCN:0x000000000255f937 SEQ:1 OP:13.22 ENC:0 RBL:0 FLG:0x0000
4373105:SCN: 0x0000000002560b1f SUBSCN:  1 09/05/2022 12:20:14
...
4373283:CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09000a00 OBJ:143917 SCN:0x00000000025602b1 SEQ:2 OP:13.22 ENC:0 RBL:0 FLG:0x0000
4373289:SCN: 0x0000000002560b24 SUBSCN:  1 09/05/2022 12:20:14
...
4373466:CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09000581 OBJ:143917 SCN:0x0000000002560886 SEQ:2 OP:13.22 ENC:0 RBL:0 FLG:0x0000
4373472:SCN: 0x0000000002560b28 SUBSCN:  1 09/05/2022 12:20:14
4373479:SCN: 0x0000000002560b28 SUBSCN:  1 09/05/2022 12:20:14
4373485:SCN: 0x0000000002560b28 SUBSCN:  1 09/05/2022 12:20:14
4373486:CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09000480 OBJ:143917 SCN:0x00000000025609a7 SEQ:1 OP:13.22 ENC:0 RBL:0 FLG:0x0000
4373492:SCN: 0x0000000002560b2a SUBSCN:  1 09/05/2022 12:20:14
4373499:SCN: 0x0000000002560b2a SUBSCN:  1 09/05/2022 12:20:14

Note the 5 minute gap between the updates to the level 1 bitmap block 0x09001100 and block 0x09000480, which was one of the 3 “problem” blocks with the unusual “row:0.0” entry in the treedump.

One last little detail to highlight mechanisms – if I sort the leaf blocks from the treedump I took while the shrink was waiting this is what the last few lines looks like:

      leaf: 0x90004ac 150996140 (77: row:70.70 avs:6666)
      leaf: 0x90004ad 150996141 (69: row:377.377 avs:833)
      leaf: 0x90004ae 150996142 (54: row:49.49 avs:7065)
      leaf: 0x90004af 150996143 (6: row:377.377 avs:833)
      leaf: 0x90004b0 150996144 (52: row:377.377 avs:833)
      leaf: 0x90004b1 150996145 (70: row:377.377 avs:833)
      leaf: 0x90005ed 150996461 (41: row:0.0 avs:7996)
      leaf: 0x9000a36 150997558 (67: row:0.0 avs:7996)
      leaf: 0x9001105 150999301 (9: row:0.0 avs:7996)

Check the last three “outliers” in this list and compare them with the bitmap updates recorded after the 5 minute wait. (The very last OP:13.22 – to block 0x9000480 – is the change from state 3 to state 2.) Oracle does as much as it can as soon as it can, and then clears up the last few messy bits while letting everything else go on without interruption. You may find, however, that the strategy of bypassing and returning to leaf blocks that hold active transactions may – depending on the pattern of data – result in a large number of blocks that Oracle has not been able to pack to the greatest possible extent.

If you have a large number of small transactions executing and committing you’ll see the same sort of effect. The impact on the final size of the index and the number of blocks that haven’t achieved maximum utilisation) will depend very much on the data patterns and the nature of processing done by the competing sessions.

If we need to investgate further, we can always examine the ksq trace. There are a couple of details in this that vary vary from the coalesce trace. We see the same TM enqueue in mode 2 held for the duration of the process, but instead of an OD enqueue we see an SK enqueue (segment shrink) taken immediately after the TM enqueue and held to the very end of processing.

In the second phase of the processing, as leaf blocks are copied into blocks nearer the start of the segment we see TX enqueues being taken in mode 4 as the process reaches a block that is still holding an active transaction; but this is no different from the normal action for “wait for competing transaction to commit or rollback”, and the enqueue is released as soon as the other session commits.

shrink space (without compact)

I find it fairly amusing that you have to extend the shrink space command if you want it to do less work. Perhaps if it were “compact only” that would feel less strange.

If you omit the “compact” option Oracle moves the highwater mark down the segment to release as many extents as possible back to the tablespace. This was immediately visible in the space management report:

Unformatted                   :            0 /                0
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :            2 /           16,384
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :            0 /                0
Full                          :          544 /        4,456,448

PL/SQL procedure successfully completed.

Segment Total blocks:          640
Object Unused blocks:           82

Critically the Segment Total blocks has dropped to 640: that’s 5 extents of 128 blocks each (remember my tablespace was declared with 1MB uniform extents).

When, to be awkward, I created the index in a tablespace declared with system-allocated extents, and gave it a storage clause of (initial 1M next 8M), it shrank to two extents, one of 128 blocks and one of 440 blocks (rather than the 1,024 blocks implied by the declaration). So shrinking indexes can result in some fairly randomly sized holes in the tablespace – the effect is similar to the trimming that takes place with parallel “create table as select” and “create index”. The effect isn’t a total mess, though, since it is catering for the 1MB, 8MB, 64MB “boundaries” of system-allocated tablespaces and not a purely random trim.

So the next thing we need to look at is the locking that’s going on, and any collateral mechanisms that show us the work Oracle does as it’s adjusting the highwater mark and releasing the extents to the tablespace.

Shrink Space – locking

There was very little difference in volume of undo and redo when comparing shrink space compact with shrink space – the latter averaged a little more than the former, but with the variations due to the occasional restarts and the undo segment stealing the difference wasn’t significant. Critically, of course, there there were a number of extra transations due to the “spare” extents being dropped as the highwater mark was lowered. Each extent dropped required an update to the seg$ table, and each update was executed as a separate transaction – interestingly, although the undo generated by the shrinking was all dumped into a single undo segment, the recursive dropping of the extents rotated through the available undo segments, producting the following type of figures for the rollback statistics:

USN   Ex Size K  HWM K  Opt K      Writes     Gets  Waits Shr Grow Shr K  Act K
----  -- ------  -----  -----      ------     ----  ----- --- ---- ----- ------
   0   0      0      0      0           0        1      0   0    0     0      0
   1   0      0      0      0        2426       92      0   0    0     0      0
   2   0      0      0      0         782       82      0   0    0     0      0
   3   0      0      0      0        3410      102      0   0    0     0      0
   4   0      0      0      0        1612       87      0   0    0     0      0
   5   0      0      0      0        1260       87      0   0    0     0      0
   6 416  91584      0      0    60289862    16521      0   0  416     0 -62613
   7   0      0      0      0        5202      130      0   0    0     0      0
   8-380 -90616      0      0        5828      629      0  38    0   -24      0
   9   0      0      0      0        1582       89      0   0    0     0      0
  10   0      0      0      0        1468       87      0   0    0     0      0

You’ll notice that most of the undo segments saw a few writes. A particular side-note in this set of results is the effect on undo segment 8 – while segment 6 grew by 90MB this was at the cost of segment 8 shrinking by 90MB. If you try to shrink several segments one after the other you could seriously disrupt any other long-running activity on the system as each shrink steals (possibly “unexpired”) extents from other undo segments as it grows. (You might even see some time spent waiting on the “US” (undo segment) and “CR” enqueues.)

One of the surprising details of the final phase of the shrink space command was that the TM lock on the underlying table was taken and released twice (in mode 6) after the mode 2 for the phases 1 and 2 was released. Given the number and timing of the CR (reuse block range) enqueues that appeared around this time it’s possible that the first lock was held while the redundant extents were forced to disc, and the second was held while the segment header and extent map blocks were updated and forced to disc. The SK enqueue taken at the very start of the shrink was was held all the way through this processing.

Shrink Space – concurrency

As before concurrent transactions will run uninterrupted but skipping blocks which are subject to active transactions in “phase 1”; then in “phase 2” as blocks are copied from the high end of the segment to the low end of the segment the shrink will wait for each block that is still subject to an active transaction. The concurrency is good, as the bulk of the work takes place while the shrinking session is holding its TM lock in mode 2.

When we get to the moment when extents are de-allocated and the highwater marks adjusted the session takes and releases two TM locks in mode 6 in rapid succession. If another session manages to update a row (i.e. taking a TM/3 lock on the table) before the shrink session gets its first TM/6 lock the shrinking session will have to wait for the session to commit. It does this in the normal fashion – timing out and restarting every 3 seconds, and checking for a local deadlock every 60 seconds. So any transaction that manages to slip in to update the table between the shrink and the release of space could cause the table to be locked indefinitely. This doesn’t appear to be any worse than the problem introduced by the waits for “locked” leaf blocks as the “compact” tries to copy them towards the start of the segment, though.

Summary (so far)

alter index xxx coalesce is an online operation which reads through the index in order, copying index entries backwards to fill leaf blocks (according to the pctfree setting). This is not a row-by-row process, the session constructs full leaf blocks in private memory and “initialises” them into the database, re-initialising any blocks that have been made empty at the same time. The leaf block that was the last one to contribute index entries to the full block will be used as the target for the next fill – unless the transfer process left it completely empty in which case Oracle starts with the next leaf block in the index.

When the process empties a leaf block, Oracle unlinks it from the index structure. This means deleting an entry from the level 1 branch block and modifying the two leaf blocks either side of the empty block so that the “next pointer” of the previous block points to the next leaf block, and the “previous pointer” of the next block points to the previous block. The bitmap entry for the the empty block can then be set to report it as “Freespace 2” – ready for re-use anywhere else in the index.

The copy-back process doesn’t cross the boundaries of level 1 branch blocks, so the last leaf block for a branch block may not reach the limit of rows it’s allowed to hold. Also the first leaf block of a branch block is never substituted, and it may (for reasons I don’t know) end up holding less than the expected maximum row count.

It is possible to make a reasonably estimate of the undo and redo generated by a call to coalesce an index.

The coalesce operates as a large number of small transactions (working through the index on pairs of adjacent blocks) and will cycle through all the undo segments in your undo tablespace.

If a coalesce reaches a block that holds an active transaction it will skip the block and move on to the next available leaf block, so a little light activity on the index could significantly reduce the effectiveness of the coalesce and different indexes on the same table could be affected very differently because of the pattern that exist in the data and its indexes.

alter index xxx shrink space [compact] is a two or three phase process depending on whether you include the compact option or not. The first phase seems to be very similar to the work done by the coalesce command and, like the coalesce command, operates as a large number of small transactions, skipping any leaf blocks that contain an active transaction. This allows it to be an online process.

After packing as many leaf blocks as much as possible, shrink space moves into “phase 2” where it copies leaf blocks from the “end” of the segment to empty blocks near the beginning of the segment, working backwards down the extents and blocks. If it finds a leaf block with an active transaction while doing this it waits for the transaction to commit using the normal TX mode 4 wait.

At the end of phase 2 the index is packed into the smallest set of blocks at the low end of the segment and all the other blocks allocated below the highwater mark are flagged as “FS4” (75 – 100% free), unlike the blocks for a coalesce which are flagged as “FS2” (25 – 50% free). In both cases this actually means “empty and available for reuse”.

Like the coalesce command “shrink space” holds a TM lock on the table in mode 2 for this activity, but differs in its choice of “secondary” lock, using an SK enqueue in mode 6 rather than an OD enqueue. Unlike the coalesce command the shrink space command restricts the undo generated so far to a single undo segment – which could cause some disruptive side effects if other long running jobs are generating undo at the same time.

Summary of “phase 3” coming soon.

August 22, 2022

Encryption oddity

Filed under: Bugs,LOBs,Oracle,Troubleshooting — Jonathan Lewis @ 12:34 am BST Aug 22,2022

Here’s a strange problem (with possible workaround) that appeared in a thread on the Oracle developer forum a couple of days ago. It looks like the sort of problem that might be a memory overflow problem in a rarely use code path, given that it seems to need a combination of:

  • move LOB column
  • varchar2() declared with character semantics
  • transparent data encryption (TDE)

Here’s a simple script to generate a tiny data set that demonstrates the problem in my 19.11.0.0 system – but you’ll need to enable transparent data encryption if you want it to work (see footnote).

rem
rem     Script:         encryption_problem.sql
rem     Author:         Solomon Yakobson / Jonathan Lewis
rem     Dated:          August 2022
rem
rem     Last tested
rem             21.3.0.0 
rem             19.11.0.0
rem

drop table test_tbl_move purge;

create table test_tbl_move(
        encrypted_column varchar2(9 char) 
                encrypt using 'aes192' 'sha-1' no salt 
                constraint ttm_nn_ec not null,
        clob_column      clob
)
lob (clob_column) store as securefile
;

insert into test_tbl_move
values( '123456789', 'x')
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'test_tbl_move',
                method_opt  => 'for all columns size 1'
        );
end;
/

alter table test_tbl_move move lob(clob_column) store as securefile;

The script creates a table with a clob column (enable storage in row by default), and a single encrypted varchar2() column declared using character semantics viz: “9 CHAR“.

We insert a row, gather stats – just in case – and then try to move the LOB storage (which also has to move the table). Would you expect a result like the following:

alter table test_tbl_move move lob(clob_column) store as securefile
*
ERROR at line 1:
ORA-12899: value too large for column ??? (actual: 27, maximum: 9)

That “maximum: 9” suggests that Oracle is complaining about encrypted_column – but why? Before worrying about that question, though, I wanted to pursue a different point: if you check the original post and compare the error message with the one above you’ll see that the “actual:” value was different. So I ran the entire test again, and again, and again; and by the time I had executed the entire script half a dozen times this is the list of error messages I had collected:

ORA-12899: value too large for column ??? (actual: 27, maximum: 9)
ORA-12899: value too large for column ??? (actual: 31, maximum: 9)
ORA-12899: value too large for column ??? (actual: 29, maximum: 9)
ORA-12899: value too large for column ??? (actual: 29, maximum: 9)
ORA-12899: value too large for column ??? (actual: 26, maximum: 9)
ORA-12899: value too large for column ??? (actual: 21, maximum: 9)

Your first thought might have been that Oracle is trying to copy an encrypted value while forgetting that it had been encrypted, but the variation in the actual lengths makes it look as if Oracle is injecting generating random data somehow (maybe through a pointer error) and generating garbage as a side-effect. Moreover if you know your encryption you’ll be suspicious of almost all the actual lengths reported because Oracle’s working is as follows:

  • Round the source length up to the next multiple of 16 bytes
  • aes192 encryption (the default encryption) will add 16 bytes (‘nomac’ and no salt)
  • adding a salt (default behaviour) will add a further 16 bytes
  • adding the ‘SHA-1’ integrity (the default) will add another 20 bytes

You can’t get an odd number of bytes as an encrypted value (unless, perhaps, the code thought it was reading a null-terminated character-string and there was a zero in the middle of it).

Workaround

You’ll see in the thread that Solomon Yakobson did a number of experiments to see what effects they had on the test case; but there was one experiment that he didn’t do that happened to be the first one I thought of. (There was a certain amount of luck in the choice, plus a bit of background suspicion from a couple of prior bugs I’d seen, plus it seemed to be the virtually the only thing that SY hadn’t tried).

Declaring a varchar2(x CHAR) is fairly rare – and with all the messing around with padding, encoding etc. going on, the code to handle multi-byte character sets might be a fruitful source of bugs. So I re-ran the test, but changed the declaration from varchar2(9 CHAR) to varchar2(9 [byte]), and Oracle happily ran my test to completion.

On its own this isn’t a complete workaround. If you’re running with a multi-byte character set a declaration using character semantics means Oracle allows you to store many more bytes than characters. Conversely, if you use byte semantics you will have to declare your column with a large enough byte count to store the number of (multi-byte) characters you really want – but then that could allow your users to insert more characters than you wanted (unless the character set was a fixed-width character set – but then you could waste a lot of space storing character strings – see this note about “in row” CLOB columns).

So, to use byte semantics with a character limit, you have to adopt a strategy that I once saw at a company running Peoplesoft (I assume it’s been changed since – it was a long time ago). They declared their varchar2() columns with far too many bytes (4 times the required character count) then added a check constraint on the length to restrict the number of characters. (In their case that resulted in tens of thousands of check constraints in the database with an undesirable overhead on dictionary cache latching and parse times).

Here’s an alternative declaration of the table that allows the alter table move command to work and still ensures the correct maximum number of characters in the varchar2() column:

create table test_tbl_move(
        encrypted_column varchar2(18)
                encrypt using 'aes192' 'sha-1' no salt
                constraint ttm_nn_ec not null 
                constraint ttm_ck_ec_len9 check (length(encrypted_column) <= 9),
        clob_column      clob
)
lob (clob_column) store as securefile
/

Table created.

SQL> insert into test_tbl_move values('0123456789','xxx');
insert into test_tbl_move values('0123456789','xxx')
*
ERROR at line 1:
ORA-02290: check constraint (TEST_USER.TTM_CK_EC_LEN9) violated

Footnotes

  1. If, like me, the last time you’d played around with encryption was in 11g you’ll find that a lot has changed in setting it up – not only in the added requirements for pluggable databases but also with a new command framework. (As usual Tim Hall’s blog on the topic is a good starting point if you want to do a quick experiment in a sand box.)
  2. The code sample include SHA-1 as the integrity algorithm – ‘NOMAC’ is the only alternative, and in any single table the same algorithm has to be used for all encrypted columns. (If you try to use SHA-1 on one column and NOMAC on another as you create the table Oracle will raise “ORA-28379: a different integrity algorithm has been chosen for the table”. More importantly – a note in the Oracle 21c reference manual states that SHA-1 is deprecated from that version onwards and advises moving from TDE column encryption to TDE tablespace encryption.

August 5, 2022

drop partition

Filed under: Indexing,Infrastructure,Oracle — Jonathan Lewis @ 8:24 pm BST Aug 5,2022

This note is about some testing I did on the consequences of the (new in 12c) “deferred global index maintenance” feature that Oracle introduced as a strategy to reduce the impact of dropping partitions from a partitioned table.

Looking at my notes I see that I created my first test in August 2013 on Oracle 12.1.0.1 – probably after reading Richard Foote’s series on the topic.

At the time I didn’t turn my notes into a blog post but a recent request on the MOS Community Forum (needs a MOS account) prompted me to revisit and extend the tests using 19c.

  1. The Request
  2. The Background
  3. The Model
  4. Tests and Results
  5. Deep Dive
    1. dbms_space
    2. Tree dumps
    3. Dumping Redo
    4. Transactions and locking
  6. Summary
  7. Footnote

The Request

The database is 19.3, two-node RAC with a standby (type and function not specified). There is a table range-partitioned by month holding nearly three years of data. The table size is about 250GB with indexes totalling a further 250GB, and the OP wants to drop the partitions older than one year.

There was an issue doing this on some other environment when running the daily maintenance windows described as: “it consumes a lot of CPU but I could not find a link between both activities”.

Is there anything that could be done to avoid any db impact especially as this is a “production 24” environment?

There was no comment about how many indexes there were (and that’s an important detail), nor how many were global, globally partitioned, or local (also important details) [Ed: information later supplied – 14 indexes, all global], but there was a comment that the pmo_deferred_gidx_maint_job was run immediately after the drop, generated a lot of redo, and was still running after 10 hours – so it’s reasonably safe to assume that global indexes had a big impact since that’s the “partition maintenance operations deferred global index maintenance” job.

From the comment about the system being “production 24” I assume that the target is to come up with a strategy that doesn’t deny access to the users for a few hours, has the least possible impact on what they normally do, and doesn’t require the standby database to be (partially) rebuilt / unavailable.

Since this is Oracle 19c and the OP wants to drop nearly two-thirds of the data (i.e. significantly more than he’s keeping) the “obvious” strategy to investigate is dropping the partitions (online, with update indexes) then rebuilding the global (or globally partitioned) indexes (online).

At a minimum it would be sensible to do some modelling to get some idea of why the other system spent so much time in pmo_deferred_gidx_maint_job as this might allow you to work out either that it wouldn’t be a problem in this system, or that there was a variation on the method that would be better, or that you just don’t want to use the job because you’ve got a good idea of just how nasty it would be.

The Background

Deferred index maintenance means that global index maintenance does not take place when you drop a partition. Historically Oracle would, as part of the drop, delete every single index entry for the dropped partition from every single global index – doing a lot of work and taking a lot of time at the moment of the drop. Deferred maintenance means Oracle simply notes which object_ids no longer exist and then, when reading through a global index, ignores index entries where the rowid includes the object_id of a dropped object.

Note: the rowid stored in a B-tree index for a global index is made up of 4 components that require a total of 10 bytes of storage. In order of appearance these are: (object_id of table partition, tablespace-relative file_number, block_id, row number within block). For a local index or index on a simple heap table the object_id of the table can be inferred from the identity of the index so it is not stored and the rowid takes only 6 bytes of storage.

So the benefit of deferred maintenance is that dropping a partition takes virtually no time at all, but (a) Oracle has to clean up the garbage at some point and (b) until the garbage has been cleaned up it has to be read before it can be ignored.

A thought about the second point – if Oracle can check for dropped object_ids very efficiently then it doesn’t necessarily matter that you haven’t cleaned out the garbage. The continued presence of the “dropped” index entries won’t make your application run more slowly , it’s just that you won’t have achieved the (possible) benefit of a smaller index that might allow the application to run a little faster.

[Ed: see this comment from Mikhail Veilikikh, though, and my replies – there is an optimizer anomaly that means a specific optimizer feature may “disappear” from an index with orphans]

So here’s a hypothesis to explain why the OP’s previous experience of deferred maintenance was very slow : if you update global indexes in real time Oracle does that job as efficiently as possible because it can use key values and rowids from the table segment that it’s dropping to create a “delete array” for the index, which you used to detect in the sorts (rows) session statistics and a strange “insert” statement if you traced the operation:

insert 
        /*+ RELATIONAL("T1") NO_PARALLEL APPEND NESTED_TABLE_SET_SETID NO_REF_CASCADE */   
        into "TEST_USER"."T1"  partition ("P09000") 
select 
        /*+ RELATIONAL("T1") NO_PARALLEL  */ a
        *  
from    NO_CROSS_CONTAINER ( "TEST_USER"."T1" ) partition ("P09000")  
        delete global indexes

If you defer the maintenance Oracle has to walk through the index in order, one entry at a time, and work out whether or not to delete that entry – and we all know that single-row processing is more expensive than array-processing.

It’s worth noting that there are notes on MOS to support this hypothesis, e.g. Bug 27468233 : ALTER INDEX COALESCE CLEANUP IS GENERATING HUGE AMOUNT OF REDO reports an example of generating 23GB of redo while cleaning up an index of only 1.8GB. (Version 12.2.0.1)

So let’s build a model and do some simple tests.

The Model

I’m going to build a table with 6 (range) partitions and two global indexes. I’ll set up two very different patterns of data for the two indexes to see how much impact the data pattern might have.

I’ll drop the bottom three partitions, then clean up the mess in a variety of ways. There’s the call to the pmo_deferred_gidx_maint_job, which normally runs at 2:00 am daily but can be initiated by a call to dbms_scheduler.run_job; then there’s the dbms_part.cleanup_gidx() procedure that has a couple of options; then there’s a simple call to “alter index coalesce cleanup [only][parallel N]” (which needs improved documentation) and finally, of course, “alter index rebuild [online]”.

For at least some runs of the tests it will be worth enabling SQL_trace to see what happens behind the scenes; and it’s always worth checking the Session Activity Stats – and maybe some activity from some other dynamic performance views as well.

So here’s some code to create the test data set:

rem
rem     Script:         12c_global_index_maintenance_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2022
rem
rem     Last tested:
rem             19.11.0.0
rem

create table t1 
partition by range(id) (
        partition p09000 values less than ( 9000),
        partition p18000 values less than (18000),
        partition p27000 values less than (27000),
        partition p36000 values less than (36000),
        partition p45000 values less than (45000),
        partition p54000 values less than (54000)
)
as
select
        rownum - 1                      id,
        trunc(dbms_random.value(0,600)) n1,
        rpad('x',100)                   padding
from
        all_objects ao
where
        rownum <= 54000
;

create index t1_n1 on t1(n1) pctfree 0;
create index t1_id on t1(id) pctfree 0;

select 
        index_name, num_rows, s.blocks, leaf_blocks, status, orphaned_entries 
from 
        user_indexes i, user_segments s 
where 
        i.index_name = s.segment_name 
and     i.table_name='T1' 
and     partitioned = 'NO'
;

alter table t1 drop partition p09000, p18000, p27000 update global indexes; 

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1'
        );
end;
/

select 
        index_name, num_rows, s.blocks, leaf_blocks, status, orphaned_entries 
from 
        user_indexes i, user_segments s 
where 
        i.index_name = s.segment_name 
and     i.table_name='T1' 
and     partitioned = 'NO'
;


For testing purposes I’ve set the index pctfree to 0; and I’ve reported some of the index stats before and after dropping the three partitions so that we can see what the optimizer thinks the indexes look like:

Index size information before drop
==================================
INDEX_NAME             NUM_ROWS     BLOCKS LEAF_BLOCKS STATUS   ORP
-------------------- ---------- ---------- ----------- -------- ---
T1_ID                     54000        256         134 VALID    NO
T1_N1                     54000        256         128 VALID    NO


Index size information after drop
=================================
INDEX_NAME             NUM_ROWS     BLOCKS LEAF_BLOCKS STATUS   ORP
-------------------- ---------- ---------- ----------- -------- ---
T1_ID                     27000        256          68 VALID    YES
T1_N1                     27000        256         128 VALID    YES

Both indexes are valid (which is good for the application) and their segment sizing has not changed. The number of rows has halved in both indexes but the number of (populated) leaf blocks has remained unchanged in one index even though it has halved for the other.

If you dumped a few index leaf blocks the explanation for the changes (and the difference in the changes) would become clear. The number of (non-deleted) index entries in the two indexes is the same, but Oracle is (almost literally) ignoring half of them – the ones that include the object_ids for the original first three table partitions.

The t1_id index is on the (sequential) id and the table is partitioned by id, and we have dropped the partitions that hold (nothing but) the ids less than 27,000 (in earlier versions of Oracle this would have immediately deleted all the index entries from the first half of the index, leaving all the leaf blocks in the 2nd half of the index full) and although the index entries are still in those blocks Oracle is behaving as if they don’t exist, which means it treats the blocks as empty when calculating the leaf_blocks statistic. The t1_n1 index is on integer values from 0 to 599 randomly distributed across the full range of ids, so by dropping the partitions for ids less than 27,000 we (ought to) have deleted the first half of the index entries for n1 = 0, the first half for n1 = 1 and so on – leaving every index leaf block approximately half empty and still available for inclusion in the leaf_blocks count.

How, then, does Oracle manage to “ignore” the rows that would have been deleted in older versions. We can always enable SQL tracing when gathering stats, run tkprof against the trace file, and look for the SQL that Oracle used – and if that doesn’t reveal all, use the sql_id of the relevant statements to pull their plans from memory. Here’s the query (reformatted) and plan for one of the index stats gathering queries that I pulled from memory after finding it and its sql_id in the tkprof output:

SQL> select * from table(dbms_xplan.display_cursor('gtnd3aphdkp3k'));

SQL_ID  gtnd3aphdkp3k, child number 0
-------------------------------------
select /*+ 
                opt_param('_optimizer_use_auto_indexes' 'on')
                no_parallel_index(t, "T1_ID")  
                dbms_stats  cursor_sharing_exact  use_weak_name_resl 
                dynamic_sampling(0)  no_monitoring  xmlindex_sel_idx_tbl 
                opt_param('optimizer_inmemory_aware' 'false')
                no_substrb_pad  no_expand index(t, "T1_ID") 
        */ 
        count(*)                                          as nrw,
        count(distinct sys_op_lbid(130418, 'L', t.rowid)) as nlb,
        null                                              as ndk,
        sys_op_countchg(substrb(t.rowid, 1, 15), 1)       as clf 
from
        "TEST_USER"."T1" t where "ID" is not null

Plan hash value: 4265068335

--------------------------------------------------------------------------
| Id  | Operation        | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT |       |       |       |   136 (100)|          |
|   1 |  SORT GROUP BY   |       |     1 |    17 |            |          |
|*  2 |   INDEX FULL SCAN| T1_ID | 27000 |   448K|   136   (1)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter((TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T".ROWID)=1 AND "ID" IS NOT NULL))

If you’ve read Cost Based Oracle – Fundamentals you’ll recognise the SQL is typical of the pattern Oracle uses to gather stats on an index, with a couple of sys_op() function calls that dissect rowids to allow Oracle to calculate the number of leaf_blocks in, and clustering_factor of, the index. What’s new, though is the filter() in the Predicate Information that (presumably) is checking that the rowid belongs to a table partition that still exists. (In other circumstances the “<?>” would be the table-name. The value for 8 as the third parameter also appears in queries involving table-expansion with partial indexing.)

Unsurprisingly, if you execute a simple query driven through one of the indexes after dropping partitions you’ll see exactly the same filter() predicate generated for the execution plan for the range scan operation e.g:

SQL_ID  822pfkz83jzhz, child number 0
-------------------------------------
select  /*+ index(t1(n1)) */  id from t1 where n1 = 300

Plan hash value: 2152633691

---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |       |      1 |        |    44 (100)|     40 |00:00:00.01 |      41 |
|   1 |  TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| T1    |      1 |     45 |    44   (0)|     40 |00:00:00.01 |      41 |
|*  2 |   INDEX RANGE SCAN                         | T1_N1 |      1 |     45 |     1   (0)|     40 |00:00:00.01 |       3 |
---------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("N1"=300)
       filter(TBL$OR$IDX$PART$NUM(<?>,0,8,0,"T1".ROWID)=1)

The tbl$or$idx$part$num() function is an important item to consider at this point. How much impact will it have on your processing – it’s hard to give a generic answer since it may depend on exactly what your data looks like and whether or not the function can cache its result effectively. It’s also possible that the performance of the function is related to either the number of partitions dropped or the number of partitions still in existence – so that’s a detail that probably has to be tested at the correct scale before you go into production

More significantly, perhaps, is how long that impact is going to be relevant and what savings it has to be balanced against. My thought at this point is that if you drop a partition but don’t clean up the index you reduce your workload by not visiting (possibly cached) table blocks, but you pay for the benefit by calling the function for every index entry you do visit (whether or not it would have required you to visit the table prior to dropping the partition). Maybe it’s okay to leave the index uncleaned for a few days or even a few weeks before you take any steps to clean up the mess; if that’s the case then maybe you can spread a relatively large number of clean-up jobs over a long enough period of time that their impact doesn’t become visible to the users.

Tests and Results

Test number 1:

What does pmo_deferred_gidx_maint_job do?

I had to connect as a dba and grant alter on this job to my normal test user to be able to do the following test. The default role for my test user had also been granted create job, so that might have been needed as well; and you’ll see that I’ve included a number of my typical “v$ snapshot” procedures to measure different aspects of the workload, and I’ve enabled the 10046 trace to see what Oracle does behind the scenes.

alter system flush buffer_cache;

execute snap_enqueues.start_snap
execute snap_events.start_snap
execute snap_my_stats.start_snap
execute snap_redo.start_snap
execute snap_rollstats.start_snap

alter session set events '10046 trace name context forever, level 8';

execute dbms_scheduler.run_job('SYS.PMO_DEFERRED_GIDX_MAINT_JOB',true)

alter session set events '10046 trace name context off';

execute snap_rollstats.end_snap
execute snap_redo.end_snap
execute snap_my_stats.end_snap
execute snap_events.end_snap
execute snap_enqueues.end_snap

The most significant discovery in this test was that the package (ultimately) executed two SQL statements:

ALTER INDEX "TEST_USER"."T1_ID" COALESCE CLEANUP PARALLEL 1
ALTER INDEX "TEST_USER"."T1_N1" COALESCE CLEANUP PARALLEL 1

So really all I needed to do from that point onwards was to worry about investigating variations of the “alter index coalesce” command.

Test number 2:

What does the procedure dbms_part.cleanup_gidx() do?

This procedure takes a schema name and table name as its first inputs with defaults null and has two other parameters called parallel (default 0) and options (default ‘CLEANUP_ORPHANS’); the only other value you can supply for options is ‘COALESCE’. Again the 10046 trace of my wrapper was very useful, as this showed the following SQL when I specified just the schema and table name for the call:

        ALTER INDEX "TEST_USER"."T1_ID" COALESCE CLEANUP ONLY
        ALTER INDEX "TEST_USER"."T1_N1" COALESCE CLEANUP ONLY

The 19c manual does mention the “ONLY” keyword, but doesn’t explain its significance (but I will in a moment). If I re-ran the test with options set to ‘COALESCE’ the SQL statements changed to:

        ALTER INDEX "TEST_USER"."T1_ID" COALESCE CLEANUP
        ALTER INDEX "TEST_USER"."T1_N1" COALESCE CLEANUP

This did more work than the “cleanup only” run (figures in a moment). When I re-ran the tests setting the parallel option to 2 the same SQL statement appeared with “PARALLEL 2”.

So here are some of the most important numbers for the calls to dbms_part.cleanup_gidx(). They are the headline undo and redo figures from the session statistics:

Default: (cleanup_orphans)
==========================
Name                                       Value
----                                       -----
redo entries                              54,283
redo size                             12,412,092
undo change vector size                4,762,432

options=>"Coalesce"
====================
Name                                       Value
----                                       -----
redo entries                              57,427
redo size                             15,518,660
undo change vector size                6,466,264

As you consider those figures, let me remind you that the indexes started out at roughly 1MB each, and we dropped 27,000 rows in three partitions. A quick check of the “cleanup orphans” arithmetic and you can see (with rounding) 54,000 redo entries = (two indexes * 27,000 rows each), and 12.4M redo size / 54K redo entries = 230 bytes per index entry. I’ve highlighted that last result because that’s a number you can use as a baseline to estimate the redo that will be generated by cleaning up global indexes. How many rows will be dropped from the table, how many global indexes have you got on the table – multiply the two together and multiply by 230.

Of course, that’s just for “cleanup only” and it assumes that every row appears in every index (which, in a well engineered system, probably won’t be true). Where does the extra 3MB of redo come from? Let’s drop down one more level in the processing and run explicit “alter index” statements through the test harness.

Test number 3:

What does alter index xxx coalesce cleanup [only] do?

Here are the redo and undo summaries fron two sets of tests – “coalesce cleanup only”, and “coalesce cleanup”

Index t1_n1                         Cleanup only            Cleanup
-----------                         -------------------------------
redo entries                              27,139             28,176
redo size                              6,209,140          8,930,900
undo change vector size                2,382,564          3,999,900

Index t1_id                         Cleanup only            Cleanup
-----------                         -------------------------------
redo entries                              27,144             28,966
redo size                              6,202,436          6,547,460
undo change vector size                2,379,868          2,461,908

For “coalesce cleanup only” the workload for the two indexes is (effectively) identical – it’s basically the undo and redo from marking 27,000 index entries as deleted and doing nothing else. The blocks have not been cleaned up in any way; that task will be left to future sessions that need to insert entries into a leaf block and find that it is full but has lots of space that can be reclaimed from deleted entries.

When we use the “coalesce cleanup” (i.e. without “only”) Oracle does some extra work, but the amount of work varies significantly depending on the nature of the index: t1_n1 generates an extra 2.7MB of redo, index t1_id generates only another 345KB. That may be a little surprising, but we’ve already had a clue that something like this might happen, and since every other strategy for “cleaning” the indexes comes down to running these variations of the coalesce command we should look a little further into what they do and how they work.

To get a complete picture we’ll have to do some work with the dbms_space package, the index treedump command, dumping redo, and we also ought to take a look at v$rollstat and v$enqueue_stat, but we’ll pursue those tasks in the Deep Dive section.

Test number 4:

What does alter index rebuild online do?

There’s a very important point to check in this test – if your database is in noarchivelog mode the rebuild will be nologging. and you’ll be fooled by the apparent efficiency of the mechanism right up to the point where you go to production and find that you’re generating a huge amount of redo. For the record, my indexes were roughly 64 blocks (512KB) each when rebuilt and produced the following redo figures (and virtually no undo):

Index t1_n1
-----------
redo entries                                 343
redo size                                594,984
redo size for direct writes              527,148
undo change vector size                   18,528
sorts (rows)                              27,058

Index t1_id
-----------
redo entries                                 345
redo size                                625,956
redo size for direct writes              560,092
undo change vector size                   18,096
sorts (rows)                              27,022

I’ve included the sorts statistic as a reminder that there are other (potentially nasty) overheads to consider. And when you do an online rebuild Oracle will have to lock the table briefly create a journal table (effectively a materialized view log) to capture the changes that go on while the rebuild is running then apply them when the rebuild is nearly complete, and the rebuild has to be based on a tablescan and sort.

Depending on what fraction of the partitions you are dropping, though, this does look like a very promising option – especially when you have to cater for the problem of shifting the redo to a remote site.

Deep Dive

We have seen how much redo was generated for both “coalesce cleanup” and “coalesce cleanup only” and have an idea that we know what’s happening, so we will be taking a look at some redo dumps to see if that confirms our suspicion. Before we get to that extreme, though, it’s worth taking a couple of simpler steps.

The dbms_space package

We can use the dbms_space.space_usage() procedure to see the state of blocks in the two index segments before and after the attempts to cleanup. It’s important to remember that for index segments the procedure uses the “FS2” state to report blocks that are “free”, i.e. formatted but contain no data, aren’t in the index structure and can be linked in to the index when an existing block needs to be split.

Here are three sets of results from calls to the procedure showing the state immediately after (and before) the partitions were dropped, then after a “coalesce cleanup only test, and then after a “coalesce cleanup” test.

Index T1_N1                   Blocks       Bytes    |  After "Cleanup Only" |   After "Cleanup" 
----------------------------------------------------|-----------------------|-------------------
Unformatted                   :    0 /         0    |      0 /         0    |      0 /         0
Freespace 1 (  0 -  25% free) :    0 /         0    |      0 /         0    |      0 /         0
Freespace 2 ( 25 -  50% free) :    1 /     8,192    |      1 /     8,192    |     67 /   548,864
Freespace 3 ( 50 -  75% free) :    0 /         0    |      0 /         0    |      0 /         0
Freespace 4 ( 75 - 100% free) :    0 /         0    |      0 /         0    |      0 /         0
Full                          :  134 / 1,097,728    |    134 / 1,097,728    |     68 /   557,056


Index T1_ID                   Blocks       Bytes    |  After "Cleanup Only" |  After "Cleanup"
----------------------------------------------------|-----------------------|-------------------
Unformatted                   :    0 /         0    |      0 /         0    |      0 /         0
Freespace 1 (  0 -  25% free) :    0 /         0    |      0 /         0    |      0 /         0
Freespace 2 ( 25 -  50% free) :    1 /     8,192    |      1 /     8,192    |     65 /   532,480
Freespace 3 ( 50 -  75% free) :    0 /         0    |      0 /         0    |      0 /         0
Freespace 4 ( 75 - 100% free) :    0 /         0    |      0 /         0    |      0 /         0
Full                          :  128 / 1,048,576    |    128 / 1,048,576    |     64 /   524,288


There are two key points in these figures:

  • cleanup only doesn’t change the state of any space usage information, and any leaf blocks that are “empty” are still linke into the index structure.
  • cleanup doesn’t do anything to release space back to the tablespace; it simply compacts the data into a smaller number of blocks and tags empty blocks as “free” (and takes them out of the index structure – though that’s not “intuitively” obvious from the figures unless you know what FS2 means for indexes).

Index treedumps

If we want to find out how many of the FS2 blocks are in the index structure and how many have been taken out of the tree then we’ll have to do a tree dump – get the object_id of the index and issue:

alter session set events 'immediate trace name treedump level {object id}';

Here are the first few lines of the tree dump for the t1_n1 index in the same three states: after the drop, after cleanup only, and after cleanup:

Immediately after drop
branch: 0x9000684 150996612 (0: nrow: 128, level: 1)
   leaf: 0x9000685 150996613 (-1: row:449.449 avs:8)
   leaf: 0x9000686 150996614 (0: row:444.444 avs:4)
   leaf: 0x9000687 150996615 (1: row:444.444 avs:4)
   leaf: 0x9000688 150996616 (2: row:444.444 avs:4)
   leaf: 0x9000689 150996617 (3: row:444.444 avs:4)
   leaf: 0x900068a 150996618 (4: row:444.444 avs:4)
   leaf: 0x900068b 150996619 (5: row:444.444 avs:4)

After "Cleanup only"
branch: 0x9000684 150996612 (0: nrow: 128, level: 1)
   leaf: 0x9000685 150996613 (-1: row:449.214 avs:8)
   leaf: 0x9000686 150996614 (0: row:444.216 avs:4)
   leaf: 0x9000687 150996615 (1: row:444.216 avs:4)
   leaf: 0x9000688 150996616 (2: row:444.226 avs:4)
   leaf: 0x9000689 150996617 (3: row:444.216 avs:4)
   leaf: 0x900068a 150996618 (4: row:444.206 avs:4)
   leaf: 0x900068b 150996619 (5: row:444.231 avs:4)
   leaf: 0x900068c 150996620 (6: row:444.201 avs:4)
 
After "Cleanup"
branch: 0x9000684 150996612 (0: nrow: 64, level: 1)
   leaf: 0x9000685 150996613 (-1: row:446.446 avs:16)
   leaf: 0x9000688 150996616 (0: row:444.444 avs:4)
   leaf: 0x900068a 150996618 (1: row:444.444 avs:4)
   leaf: 0x900068c 150996620 (2: row:444.444 avs:4)
   leaf: 0x900068e 150996622 (3: row:444.444 avs:4)
   leaf: 0x9000690 150996624 (4: row:444.444 avs:4)
   leaf: 0x9000692 150996626 (5: row:444.444 avs:4)
   leaf: 0x9000694 150996628 (6: row:444.444 avs:4)
   leaf: 0x9000696 150996630 (7: row:444.444 avs:4)
   leaf: 0x9000698 150996632 (8: row:444.444 avs:4)
   leaf: 0x900069a 150996634 (9: row:421.421 avs:12)

This is the index where the values 0 to 599 have been spread randomly across the 54,000 different id values, so each n1 value appears in roughly 90 rows – of which we’ve deleted about 45 by dropping the bottom 3 partitions of 6.

For leaf blocks “row:X,Y” means there are X rows in the block directory of which Y would be visible if you ignored the ones marked as committed deletes, “avs:N” shows N bytes of available space in the block.

The t1_n1 index can fit 444 index entries per leaf block and we can see that immediately after the “drop partition” none of them has been marked as deleted. After a “cleanup only” however we can see that (as expected) roughly half the rows in every leaf block have been marked as deleted with half remaining. After a “cleanup” we can see that we’re back to 444 rows per leaf block with no deletions and virtually no freespace.,

Notice, however, the way that the leaf block addresses have changed during the cleanup. If we examine just the last 3 digits of the decimal version of the leaf block addresses we start with:

613, 614, 615, 616, 617, 618, 618, 619, 620

but we end with:

613, 616, 618, 620 ..

Effectively, Oracle has “copied back” all the index entries from block 614 and some from block 615 to block 613, detaching the now-empty block 614 from the index structure, then it has copied the remaining row from block 615 to block 616 and copied back some rows from 617 to block 616, detaching the now-empty block 615. (It’s not likely that Oracle thinks in terms of “copying forward/back”, it’s more likely that Oracle simply reads through the index in order constructing new leaf blocks in private memory and has a simple algorithm for deciding which block to replace – and that algorithm might be something we can infer by looking at the redo dump.)

If we now examine the three tree dumps from index t1_id we can see why the volume of redo generated by the two indexes differs on the final “cleanup” phase of the code.

Immediately after drop
branch: 0x9000784 150996868 (0: nrow: 134, level: 1)
   leaf: 0x9000785 150996869 (-1: row:426.426 avs:7)
   leaf: 0x9000786 150996870 (0: row:421.421 avs:1)
   leaf: 0x9000787 150996871 (1: row:421.421 avs:1)
   leaf: 0x9000788 150996872 (2: row:421.421 avs:1)
   leaf: 0x9000789 150996873 (3: row:421.421 avs:2)
   leaf: 0x900078a 150996874 (4: row:421.421 avs:1)
   leaf: 0x900078b 150996875 (5: row:421.421 avs:1)
...
   leaf: 0x90007c6 150996934 (64: row:400.400 avs:0)
   leaf: 0x90007c7 150996935 (65: row:400.400 avs:0)
   leaf: 0x90007c8 150996936 (66: row:400.400 avs:0)
   leaf: 0x90007c9 150996937 (67: row:400.400 avs:0)
   leaf: 0x90007ca 150996938 (68: row:400.400 avs:0)
   leaf: 0x90007cb 150996939 (69: row:400.400 avs:0)

After "Cleanup only"
branch: 0x9000784 150996868 (0: nrow: 134, level: 1)
   leaf: 0x9000785 150996869 (-1: row:426.0 avs:7)
   leaf: 0x9000786 150996870 (0: row:421.0 avs:1)
   leaf: 0x9000787 150996871 (1: row:421.0 avs:1)
   leaf: 0x9000788 150996872 (2: row:421.0 avs:1)
   leaf: 0x9000789 150996873 (3: row:421.0 avs:2)
   leaf: 0x900078a 150996874 (4: row:421.0 avs:1)
   leaf: 0x900078b 150996875 (5: row:421.0 avs:1)
...
   leaf: 0x90007c6 150996934 (64: row:400.0 avs:0)
   leaf: 0x90007c7 150996935 (65: row:400.303 avs:0)
   leaf: 0x90007c8 150996936 (66: row:400.400 avs:0)
   leaf: 0x90007c9 150996937 (67: row:400.400 avs:0)

After "Cleanup"
branch: 0x9000784 150996868 (0: nrow: 68, level: 1)
   leaf: 0x90007c7 150996935 (-1: row:303.303 avs:1940)
   leaf: 0x90007c8 150996936 (0: row:400.400 avs:0)
   leaf: 0x90007c9 150996937 (1: row:400.400 avs:0)
   leaf: 0x90007ca 150996938 (2: row:400.400 avs:0)
   leaf: 0x90007cb 150996939 (3: row:400.400 avs:0)
   leaf: 0x90007cc 150996940 (4: row:400.400 avs:0)
   leaf: 0x90007cd 150996941 (5: row:400.400 avs:0)

I’ve shown two sections of the treedump for the first two extracts, the start of the index and the start of the “2nd half” of the index where the id values are in the partitions that we kept. You’ll notice that the number of index entries per leaf block drops from 426 to 400 as we move through the index, that’s just the effect of a sequential id generally getting bigger (42 takes 2 bytes, 42,000 takes 3 bytes, 42,042 takes 4 bytes).

After “cleanup only” all the leaf blocks in the first half the the index show “no rows remaining of 4xx”, while all the leaf blocks in the 2nd half the index show “400 rows remaining of 400”. There is one special case – the leaf block numbered 65 at address 0x90007c7 – which shows “303 rows remaing of 400”. That must be the block that held the highest few rows from partition p27000.

After the final cleanup we can see that this “mid-point” leaf block has become the “low value” leaf block, and the rest of the index leaf blocks look as if they are completely unchanged. (I think we can assume that the “copy forward/back” code caters for a few boundary conditions that (e.g.) stop Oracle from doing something silly with leaf blocks that are already completely full.)

In this case, then, we can guess that Oracle has simply removed 66 leaf blocks (blocks “-1” to 64) from the index structure and reconnected block number 65 as the starting block. In the previous case the “cleanup” redo disconnected a simlar number of leaf blocks, but also rewrote all the blocks that were kept or emptied.

Dumping Redo

The information we have examined so far prompts us to ask two significant questions:

  • Moving from the “cleanup only” to the “cleanup” increased the redo by 345KB in the case of the sequential t1_id index, but by 2.7MB in the case of the “random arrival” t1_n1 index. In both cases the number of leaf blocks to be relinked is very similar, so the difference seems to be due to the work done in compacting roughly 128 leaf blocks down to 64 leaf blocks – how does that generate (2700KB – 345KB)/64 = 36KB per “new” block? We might also wonder why “just” relinking 66 blocks appears to take 345K/66 = 5K per block anyway – that seems a little surprising.
  • For the full cleanup operation does Oracle delete the entries from a few consecutive blocks, compact them and write them and then carry on with the next few blocks; or (less likely) does it walk the entire index deleting entries and then walk the index again to tidy up and compact (logically) adjacent blocks. If the latter then for a very large index (rather than my tiny test) that means we could see two consecutive full scans physically reading the whole index, and possibly generating more redo thanks to delayed block cleanout.

If we dump the redo log we can answer these questions by extracting some fairly simple details – although the resulting trace file is going to be rather large. For example we could code:

column current_scn new_value start_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter index t1_n1 coalesce cleanup;

column current_scn new_value end_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter session set tracefile_identifier='n1_index';
alter system dump redo scn min &start_scn scn max &end_scn /* layer 10 */;
alter session set tracefile_identiier='';


I’ve included a commented out “layer 10” in the dump command to show how you can be selective in what redo you dump. Layer 10 is the set of redo op codes for index-related change vectors. You will find other op codes (in particular 5.1) being dumped as well because Oracle dumps the whole of any redo record containing a change vector of the reqeusted type.

When I dumped the redo for a single index cleanup the trace file was about 45MB – so not something you would want to read in detail – but you could start with a few simple searches, for example:

grep -n "OP:10"       or19_ora_13388_n1_index.trc >temp_n1_op10.txt
grep -n "REDO RECORD" or19_ora_13388_n1_index.trc >temp_n1_record.txt

grep "OP:"  or19_ora_13388_n1_index.trc | sed "s/^.*OP:/OP:/" | sed "s/ .*$//" | sort | uniq -c | sort -n

The first grep example simply extracts all the index-related op-codes (with line numbers from the trace file to make it easier to spot patterns. The second grep does the same for the start of each redo record because those lines also report the length of each record, which may make it possible to find out more about the surprising amount of redo generated by the compacting and relinking.

The third example is just a bit of showing off: it extracts all the op-code lines, cuts the actual OP:nn.nn bit out, then sorts and counts the number of appearances of each op-code. Here’s the result of that last command from one of my tests:

      1 OP:13.24
      1 OP:17.28
      1 OP:24.1
      2 OP:11.5
      3 OP:10.11
      3 OP:11.3
     15 OP:10.14
     18 OP:14.4
     18 OP:22.5
     28 OP:10.7
     29 OP:5.11
     36 OP:22.2
     64 OP:13.22
     68 OP:10.34
     82 OP:10.39
    138 OP:10.8
    205 OP:5.4
    219 OP:10.6
    259 OP:4.1
    324 OP:10.5
    347 OP:5.6
    689 OP:5.2
  27327 OP:10.4
  27833 OP:5.1

It’s an obvious guess that the 27,000 OP:10.4 at the bottom are “delete index entry” and most of the OP:5.1 are their corresponding undo change vectors. The OP:10.4 count is a little high, but checking the session activity stats I found that they showed: rollback changes – undo records applied = 376, so some internal transaction rollback took place, and the session stats are a reasonable match for the “excess” OP:10.4, suggesting that at some point we saw a batch of deletes rollback and restart. (NOTE: to be investigated further; this suggests that the entire operation is executing as a number of relatively small transactions and could safely be interrupted – a hypothesis supported by the 205 OP:5.4 (commit/rollback change vectors)). The “excess” OP:10.4 are also closely matched by the OP:10.5 / OP:5.6 figures (restore leaf during rollback / mark user undo record applied)

Before chasing any other details let’s answer the question about whether the compacting takes place during the delete phase or starts after the deletes are all done. If we jump to the end of the “OP:10” file we can check to see if there are OP:10.4 all the way through the file with small patches of other layer 10 op codes stuff scattered throughout, or if the file is continuously doing OP:10.4 and nothing else until a load of other layer 10 op codes appear in the last couple of thousand lines. The answer is that we get regular repeats of a pattern like the following:

OP:10.4 x ~400
OP:10.6 
OP:10.6 
OP:10.6 
OP:10.39 
OP:10.8 
OP:10.34 
OP:10.8 

For example (after lots of 10.4 on blocks 0x9000695 and 0x9000696. we see:

119184:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000696 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000
119210:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000695 OBJ:131045 SCN:0x0000000002165407 SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000
119236:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000694 OBJ:131045 SCN:0x0000000002165405 SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000
119272:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000684 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:10.39 ENC:0 RBL:0 FLG:0x0000
119625:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000694 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
120499:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000695 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:10.34 ENC:0 RBL:0 FLG:0x0000
120862:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000696 OBJ:131045 SCN:0x0000000002165409 SEQ:2 OP:10.8 ENC:0 RBL:0 FLG:0x0000

This translates to:

10.6:  lock block 0x09000696
10.6:  lock block 0x09000695
10.6:  lock block 0x09000694
10.39: update branch block 0x09000684
10.8:  new block 0x09000694
10.34: empty block 0x09000695
10.8:  new block 0x09000696

The pattern then repeats starting with deletes from block 0x09000697 and 0x09000698, and then producing an interesting detail:

DBA:0x09000698 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.6 
DBA:0x09000697 OBJ:131045 SCN:0x000000000216540d SEQ:1 OP:10.6 
DBA:0x09000696 OBJ:131045 SCN:0x000000000216540b SEQ:1 OP:10.6 
DBA:0x09000684 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.39 
DBA:0x09000696 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.8 
DBA:0x09000697 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.34 
DBA:0x09000698 OBJ:131045 SCN:0x000000000216540f SEQ:2 OP:10.8 

Check line 5 above – it’s another OP:10.8 creating another version of block 0x09000696. And that’s one of the reasons why the compaction creates far more redo than expected. Oracle may recreate a block several times before the block gets to its final compacted state.

Looking at the detail of the trace file – in particular the pattern of deletes followed by “new block” – it looks as if Oracle deletes all the rows from just two adjacent blocks (perhaps to minimise block-level locking) and then does the best it can with the rows that are left and this may mean (as it does in our fragment) writing one full block and one partial block. For a sparsely populated index it might mean writing just a single partial block, possibly repeating the process for several cycles.

To show the total effect on redo generation I’ve extracted the redo records for a complete cycle of the pattern (excluding several hundred deletes, which generate 228 bytes each). I’ve used egrep to pick out 3 patterns “OP:”, “REDO RECORD” and (using hindsight) “new block has”:

132650:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e90.0134 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
132652:CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x09000698 OBJ:131045 SCN:0x000000000216540e SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

132656:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e90.018c LEN: 0x0164 VLD: 0x01 CON_UID: 3792595
132658:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x00000000021653f8 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
132661:CHANGE #2 CON_ID:3 TYP:1 CLS:18 AFN:17 DBA:0x04406fdf OBJ:4294967295 SCN:0x000000000216540f SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
132682:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000698 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

132691:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e91.0100 LEN: 0x00e4 VLD: 0x01 CON_UID: 3792595
132693:CHANGE #1 CON_ID:3 TYP:0 CLS:18 AFN:17 DBA:0x04406fdf OBJ:4294967295 SCN:0x000000000216540f SEQ:2 OP:5.1 ENC:0 RBL:0 FLG:0x0000
132708:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000697 OBJ:131045 SCN:0x000000000216540d SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

132717:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e91.01e4 LEN: 0x00e4 VLD: 0x01 CON_UID: 3792595
132719:CHANGE #1 CON_ID:3 TYP:0 CLS:18 AFN:17 DBA:0x04406fdf OBJ:4294967295 SCN:0x000000000216540f SEQ:3 OP:5.1 ENC:0 RBL:0 FLG:0x0000
132734:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000696 OBJ:131045 SCN:0x000000000216540b SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

132743:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e92.00d8 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
132745:CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x09000684 OBJ:131045 SCN:0x000000000216540a SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

132749:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e92.0130 LEN: 0x0128 VLD: 0x01 CON_UID: 3792595
132751:CHANGE #1 CON_ID:3 TYP:0 CLS:18 AFN:17 DBA:0x04406fdf OBJ:4294967295 SCN:0x000000000216540f SEQ:4 OP:5.1 ENC:0 RBL:0 FLG:0x0000
132770:CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000684 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.39 ENC:0 RBL:0 FLG:0x0000

132779:REDO RECORD - Thread:1 RBA: 0x0002c8.00046e93.0068 LEN: 0x4038 VLD: 0x01 CON_UID: 3792595
132781:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x000000000216540f SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
132784:CHANGE #2 CON_ID:3 TYP:1 CLS:18 AFN:17 DBA:0x04406fe0 OBJ:4294967295 SCN:0x000000000216540f SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
133123:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000696 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.8 ENC:0 RBL:0 FLG:0x0000
133139:new block has 444 rows

133653:REDO RECORD - Thread:1 RBA: 0x0002c8.00046eb5.0010 LEN: 0x20e8 VLD: 0x05 CON_UID: 3792595
133656:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x000000000216540f SEQ:2 OP:5.2 ENC:0 RBL:0 FLG:0x0000
133659:CHANGE #2 CON_ID:3 TYP:1 CLS:18 AFN:17 DBA:0x04406fe1 OBJ:4294967295 SCN:0x0000000002165410 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
133998:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000697 OBJ:131045 SCN:0x000000000216540f SEQ:1 OP:10.34 ENC:0 RBL:0 FLG:0x0000

134007:REDO RECORD - Thread:1 RBA: 0x0002c8.00046ec6.0010 LEN: 0x0064 VLD: 0x01 CON_UID: 3792595
134009:CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09000680 OBJ:131045 SCN:0x0000000002165409 SEQ:1 OP:13.22 ENC:0 RBL:0 FLG:0x0000

134017:REDO RECORD - Thread:1 RBA: 0x0002c8.00046ec6.0074 LEN: 0x3a74 VLD: 0x01 CON_UID: 3792595
134019:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x0000000002165410 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
134022:CHANGE #2 CON_ID:3 TYP:1 CLS:18 AFN:17 DBA:0x04406fe2 OBJ:4294967295 SCN:0x0000000002165410 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
134361:CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000698 OBJ:131045 SCN:0x000000000216540f SEQ:2 OP:10.8 ENC:0 RBL:0 FLG:0x0000
134377:new block has 362 rows

134799:REDO RECORD - Thread:1 RBA: 0x0002c8.00046ee4.00c8 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
134801:CHANGE #1 CON_ID:3 TYP:0 CLS:17 AFN:17 DBA:0x044001c0 OBJ:4294967295 SCN:0x0000000002165410 SEQ:2 OP:5.4 ENC:0 RBL:0 FLG:0x0000

This shows a complete transaction (the previous redo record is the commit (OP:5.4) for the few hundred deletes. I’ve highlighted five lines in the output – the three redo record headers that report a large value for their LEN and two lines that show you how many rows are in the “new” blocks generated by the op codes OP:10.8

  • Line 24 – LEN = 0x4038 = 16,440 bytes. That’s basically the size of two blocks: the OP:5.1 is the old image (8KB) of block 0x9000696, the OP:10.8 is the new image (also 8KB – since its full) of the block
  • Line 30 – LEN = 0x208 = 8,424. That’s basically one block image. The OP:10.34 is “clear block 0x09000698” which requires very little redo, but the OP:5.1 is the old image (8KB) of the block.
  • Line 38 – LEN = 0x3a74 = 14,964 bytes. Again that’s basically the size of two blocks: the OP:5.1 is the old image (8KB) of block 0x9000698, the OP:10.8 is the new image of the block, but it has only 362 rows (of a final 444 rows) in it, so it’s image dump can be restricted to about two pieces totalling 6.5KB

All the other bits add up to about 1,200 bytes of redo, so in each cycle the compacting activity generates about 40KB of redo in total. Since we end up with 68 filled blocks the total “extra” redo we got as we switched from “coalesce compress only” to “coalesce compress” should be around 68 * 40KB = 2.65 MB, which is pretty close to what we atually saw for the t1_n1 index.

Transactions and Locking

I commented on the presence of the session statistics reporting “rollback changes – undo records applied”, and mentioned the presence of the OP:5.4 records in the redo. These are all pointers to the coalesce command being operated as a series of smaller transactions rather than one large transaction, and even before I had started to dump and examine the redo I had added v$rollstat and v$enqueuestat to my usual snapshot of dynamic performance views. Here are the results from a typical “coalesce cleanup” test for the t1_n1 index.

---------------------------------
Rollback stats
---------------------------------
USN   Ex Size K  HWM K  Opt K      Writes     Gets  Waits Shr Grow Shr K  Act K
----  -- ------  -----  -----      ------     ----  ----- --- ---- ----- ------
   0   0      0      0      0           0        2      0   0    0     0      0
   1   0      0      0      0      382184       85      0   0    0     0      0
   2   0      0      0      0      412930       91      0   0    0     0     61
   3   0      0      0      0      404260       90      0   0    0     0      0
   4   0      0      0      0      413412       91      0   0    0     0   -106
   5 -13  -1792      0      0      341294       99      0   2    0    24    -71
   6   0      0      0      0      355986       81      0   0    0     0      0
   7   0      0      0      0      444054       97      0   0    0     0      0
   8 -12  -1728      0      0      418474      114      1   2    0    24    -45
   9   0      0      0      0      422776       92      0   0    0     0     38
  10   0      0      0      0      426976       94      0   0    0     0      0


----------------------------------
System enqueues
----------------------------------
Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
KI             3           0           3           0           0 contention
CR           250          24         250           0          12 block range reuse ckpt
IS            25           0          25           0           0 contention
TM             8           0           8           0           0 contention
TA             2           0           2           0           0 contention
TX           198           0         198           0           0 contention
US             8           0           8           0           0 contention
HW             8           0           8           0           0 contention
TT             4           0           4           0           0 contention
SJ             2           0           2           0           0 Slave Task Cancel
CU             2           0           2           0           0 contention
OD             1           0           1           0           0 Serializing DDLs

-------------------------------------
System REDO stats
-------------------------------------
Name                                       Value
----                                       -----
redo entries                              27,977
redo size                              8,853,604
undo change vector size                3,967,100

I have 10 undo segments online, and each has received about 400KB of undo – which comes to a total of 4MB – which is a very good match to the “undo change vector size” reported for the session. (I didn’t get any system generated rollbacks in the run.)

We can also see 198 TX enqueues – which is a very good match for the 196 OP:5.4 that I found in the sessions redo dump on this test.

As I commented in the section on redo generation, to handle the “coalesce cleanup” the session walks the index in leaf block order and (with variations dependent on situations like finding leaf blocks with nothing to delete, or finding leaf blocks that become empty after all the deletes have been done)

  • locks two logically adjacent leaf blocks
  • deletes the dropped rows
  • commits
  • locks the leaf blocks agin – sometimes with a third
  • packs the outstanding rows downwards
  • relinks leaf blocks and adjusts branch blocks as necessary
  • commits
  • repeats until end of index

(Note – “cleanup only” seems to lock, delete rows and commit each block separately, not in pairs)

This strategy has two side-effects. First, though I don’t think it’s documented, it looks as if you could simply kill the process if it started putting too much stress on your system. All that could happen is that the current (small) transaction would be rolled back, but the rest of the work that had been done up to that moment would persist. I have actually tested this, managing to kill sessions while they were in the middle of delete a batch of rows (though I’ve never managed to catch an example where a session was in the middle of relinking). After I repeated the coalesce command Oracle simply picked up where it had left off after the previous (small) rollback.

The second side effect is another possible overhead. Partitioned tables tend to be big, and each index clean-up is likely to be a fairly big job generating a lot of redo and undo. How much impact could the undo have on your system? This depends in part on what your undo retention looks like, how long the clean-up takes, and the risk of other sessions running into ORA-01555 errors. In particular there’s an odd problem of undo segment extension causing updates to restart – if an update statement (or delete, or merge with update clause) results in an undo segment extending then the statement will rollback and restart using a different locking strategy.

I don’t know if something of this sort would happen with the deletes from an index coalesce, and the deletes are very small anyway so it might not matter anyway, but the coalesce executes a large number of transactions in a relatively short time period rotating around and gradually filling every available undo segment – which means other large updates are more likely to need to extend an undo segment. And if segment extension does start to become a problem it might happen many times to the coalesce because each individual delete phase is a new transaction that could (in theory) rollback and repeat. So you should be monitoring the session / system activity stats for unusually large number for “rollback changes – undo records applied”.

It’s also worth noting that overlapping jobs will sometimes need to do a lot of work to check read (and write) consistency, and to find “upper bound commit” SCNs: if you have a process executing a large number of transactions in a short period of time it can be very expensive for another process to find an “upper bound commit” and you may see it doing a lot of reads of undo segments, reporting a number of “transaction tables consistent read rollbacks” and a large number of “transaction tables consistent reads – undo records applied” in the session stats. (Worst case scenario – the number of transactions could be similar to the number of (used) leaf blocks in the index).

One final point to consider – the report from the (system level) Enqueue Stats showed a handful of TM locks, so I enabled lock tracing for the session to see if any of them came from my session and whether they were likely to be a concurrency threat.

alter session set events 'trace[ksq] disk=medium';

Most of the TM enqueues were from my session, but they appeared only after the coalesce was complete, and they were all related to sys-owned (dictionary) tables:

2022-08-05 19:49:27.902*:ksq.c@9175:ksqgtlctx(): *** TM-00000140-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.902*:ksq.c@9175:ksqgtlctx(): *** TM-00000061-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.902*:ksq.c@9175:ksqgtlctx(): *** TM-00000049-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.902*:ksq.c@9175:ksqgtlctx(): *** TM-00000004-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.903*:ksq.c@9175:ksqgtlctx(): *** TM-0000004B-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***
2022-08-05 19:49:27.903*:ksq.c@9175:ksqgtlctx(): *** TM-00000013-00000000-0039DED3-00000000 mode=3 flags=0x401 why=173 timeout=21474836 ***

In case you’re wondering, the first TM lock (object 0x140) is the index_orphaned_entry$ table, and Oracle had to lock it at the end of the “coalesce clean up” to delete the rows that associated the three dropped table partitions with the one global index that I had cleaned.

Summary

Deferred global index maintence means you can drop partitions very quickly with virtually no interruption to service and then have Oracle clean up the related index entries at a later point in time. The drawback to deferring the cleanup is that Oracle will (as a minimum) use a row-by-row mechanism to mark all the index entries for the dropped data as deleted – at a cost of a full index scan and about 230 bytes of redo per dropped row per index – compared to a bulk-processing mechanism that can be applied if the index entries are dropped as part of the partition drop processing.

There are basically two mechanisms you can use to clean up the entries that are waiting cleaning:

  • rebuild the index (online, probably)
  • use one of the special coalesce calls on the index

You can best spread the workload and minimise interference with your normal work by micro-managing the clean up, explicitly issuing whichever “alter index coalesce” or “alter index rebuild” command you prefer for each index in turn that needs to be cleaned up.

None of the coalesce command variants returns space to the tablespace; nor do they even drop the highwater mark on the index segment. If you want to return space to the tablespace you will need to excute a “shrink space” on the index after the coalesce is complete. All the coalesce options generate a very large amount of redo (230 bytes per index entry for each delete plus – to compact the remaining rows into the smallest number of blocks – a volume that may be several times larger than the actual volume of data in the index).

The rebuild (online) will generate a much smaller volume of redo – but the penalties include the problem of journalling and applying the changes that took place as the rebuild was running; plus the cost of scanning the table and sorting the data to produce the index.

Footnote

My tests basically cover the worst case scenario (every leaf block has some entries to be deleted) and best case scenario (leaf blocks will either become completely empty on deletion, or will have no rows deleted).

If you do want to enable “instant” index maintenance (i.e. disable deferred maintanance) for a session you could execute:

alter session set "_fast_index_maintenance" = false;

The usual warning about not messing with hidden parameters until you’ve confirmed with Oracle support applies, of course.

July 26, 2022

Hinting

Filed under: Execution plans,Hints,Oracle — Jonathan Lewis @ 1:05 pm BST Jul 26,2022

This is just a lightweight note on the risks of hinting (which might also apply occasionally to SQL Plan Baselines). I’ve just rediscovered a little script I wrote (or possibly last tested/edited) in 2007 with a solution to the problem of how to structure a query to use an index fast full scan (index_ffs) following by a “table access by rowid” – a path which is not available to the optimizer for select statements (even when hinted, though it became available for deletes and updates in 12c).

It’s possible that this method was something I designed for a client using 9i, but the code still behaves as expected in 11.1.0.7. Here’s the setup and query:

rem
rem     Script:         wildcard.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2007
rem
rem     Last tested
rem             11.1.0.7
rem

create table t1
as
select
        cast(dbms_random.string('a',8) as varchar2(8))  str,
        rpad('x',100)                                   padding
from
        all_objects
where
        rownum <= 10000
;

alter table t1 modify str not null;
create index t1_i1 on t1(str);

begin
        dbms_stats.gather_table_stats(
                user, 't1', 
                cascade => true,
                method_opt => 'for all columns size 1'
        );
end;
/

explain plan for
select  
        /*+ 
                qb_name(main) 
                unnest(@subq1)
                leading(@sel$2fc9d0fe t1@subq1 t1@main)
                index_ffs(@sel$2fc9d0fe t1@subq1(t1.str))
                use_nl(@sel$2fc9d0fe t1@main)
                rowid(@sel$2fc9d0fe t1@main)
        */
        * 
from    t1 
where   rowid in (
                select  /*+ qb_name(subq1) */
                        rowid 
                from    t1 
                where   upper(str) like '%CHD%'
)
;

select * from table(dbms_xplan.display(format=>'outline alias'));

As you can see, I’ve got an IN subquery (query block subq1) to generate a list of rowids from the table for the rows that match my predicate and then my main query (query block main) selects the rows identified by that list.

I’ve added hints to the main query block to unnest the subquery (which will result in a new query block appearing) then do a nested loop from the t1 referenced in subq1 (t1@subq1) to the t1 referenced in main (t1@main), starting with an index fast full scan of t1@subq1 and accessing t1@main by rowid.

The unnest hint was actually redundant – unnesting happened automatically and uncosted. You’ll notice all the other hints are directed at a query block called sel$2fc9d0fe which is the resulting query block name when subq1 is unnested into main.

Here’s the resulting execution plan showing, amongst other details in the Outline Data, that this really was running on 11.1.0.7

Plan hash value: 1953350015

-------------------------------------------------------------------------------------
| Id  | Operation                   | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   500 | 65500 |   509   (0)| 00:00:07 |
|   1 |  NESTED LOOPS               |       |   500 | 65500 |   509   (0)| 00:00:07 |
|*  2 |   INDEX FAST FULL SCAN      | T1_I1 |   500 | 10500 |     9   (0)| 00:00:01 |
|   3 |   TABLE ACCESS BY USER ROWID| T1    |     1 |   110 |     1   (0)| 00:00:01 |
-------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$2FC9D0FE
   2 - SEL$2FC9D0FE / T1@SUBQ1
   3 - SEL$2FC9D0FE / T1@MAIN

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      USE_NL(@"SEL$2FC9D0FE" "T1"@"MAIN")
      LEADING(@"SEL$2FC9D0FE" "T1"@"SUBQ1" "T1"@"MAIN")
      ROWID(@"SEL$2FC9D0FE" "T1"@"MAIN")
      INDEX_FFS(@"SEL$2FC9D0FE" "T1"@"SUBQ1" ("T1"."STR"))
      OUTLINE(@"SUBQ1")
      OUTLINE(@"MAIN")
      UNNEST(@"SUBQ1")
      OUTLINE_LEAF(@"SEL$2FC9D0FE")
      ALL_ROWS
      DB_VERSION('11.1.0.7')
      OPTIMIZER_FEATURES_ENABLE('11.1.0.7')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(UPPER("STR") LIKE '%CHD%')

Running the test under 19.11.0.0 (and adding the hint_report option to the dbms_xplan format) this is the resulting plan:

--------------------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   500 | 55000 |    47   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| T1   |   500 | 55000 |    47   (0)| 00:00:01 |
--------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$48592A03 / T1@MAIN

Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      FULL(@"SEL$48592A03" "T1"@"MAIN")
      OUTLINE(@"SUBQ1")
      OUTLINE(@"MAIN")
      ELIMINATE_SQ(@"SUBQ1")
      OUTLINE_LEAF(@"SEL$48592A03")
      ALL_ROWS
      DB_VERSION('19.1.0')
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(UPPER("T1"."STR") LIKE '%CHD%')

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 5 (U - Unused (1), N - Unresolved (4))
---------------------------------------------------------------------------
   0 -  SEL$2FC9D0FE
         N -  index_ffs(@sel$2fc9d0fe t1@subq1(t1.str))
         N -  leading(@sel$2fc9d0fe t1@subq1 t1@main)
         N -  rowid(@sel$2fc9d0fe t1@main)
         N -  use_nl(@sel$2fc9d0fe t1@main)

   0 -  SUBQ1
         U -  unnest(@subq1)

Clearly the plan has changed – but the hint report says that Oracle has NOT ignored my hints; instead it tells us that they cannot be resolved. If we check the Query Block Name / Object Alias list and the Outline Data we see why: there is no query block named @sel$2fc9d0fe and the reason it doesn’t exist is that the optimizer has applied a previously non-existent transformation ‘eliminate_sq’ (which appeared in 12c) to subq1.

So, on the upgrade from 11.1.0.7 to 19.11.0.0 an SQL Plan Baseline that forced the path we wanted would no longer work (though it might be reported as “applied”) because there is a new transformation that we had (necessarily) not been blocking.

The solution is easy: add the hint no_eliminate_sq(@subq1) to our query and try again.

We still get the full tablescan even though the hint report tells us that the optimizer used the new hint. Here’s the new Outline Data, and the Hint Report showing that the hint was used.

  Outline Data
-------------
  /*+
      BEGIN_OUTLINE_DATA
      FULL(@"SEL$8C456B9A" "T1"@"SUBQ1")
      OUTLINE(@"SUBQ1")
      OUTLINE(@"MAIN")
      UNNEST(@"SUBQ1")
      OUTLINE(@"SEL$2FC9D0FE")
      ELIMINATE_JOIN(@"SEL$2FC9D0FE" "T1"@"MAIN")
      OUTLINE_LEAF(@"SEL$8C456B9A")
      ALL_ROWS
      DB_VERSION('19.1.0')
      OPTIMIZER_FEATURES_ENABLE('19.1.0')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(UPPER("STR") LIKE '%CHD%')

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 7 (U - Unused (4))
---------------------------------------------------------------------------
   0 -  SUBQ1
           -  no_eliminate_sq(@subq1)
           -  qb_name(subq1)

   1 -  SEL$8C456B9A
         U -  leading(@sel$2fc9d0fe t1@subq1 t1@main)
           -  qb_name(main)

   1 -  SEL$8C456B9A / T1@MAIN
         U -  rowid(@sel$2fc9d0fe t1@main)
         U -  use_nl(@sel$2fc9d0fe t1@main)

   1 -  SEL$8C456B9A / T1@SUBQ1
         U -  index_ffs(@sel$2fc9d0fe t1@subq1(t1.str))

But now the Outline Data is showing us a new hint – eliminate_join(@sel$2fc9dofe t1@main). So we’re not losing the subquery, but we’ve lost the join thanks to a transformation that was actually available in 10.2 but presumably couldn’t be applied to our code pattern until at least 12.1. So let’s try again adding in another blocking hint no_eliminate_join(@sel$2fc9dofe t1@main).

We still get the full tablescan – and this time the Outline Data tells us that the problem hint is now eliminate_join(@sel$2fc9dofe t1@subq1) – which we might have anticipated, and now address by adding no_eliminate_join(@sel$2fc9dofe t1@subq1) to the query and having one more go. This finally gets us back to the path that we had previously seen in 11.1.0.7.

Summary

This note is just another simple demonstration that hints do not guarantee plan stability across upgrades – and then showing that it can take a few experimental steps to discover what’s new in the optimizer that is making your previous set of hints ineffective.

Typically the problem will be the availability of new transformations (or enhancements to existing transformations) which manage to invalidate the old hints before the optimizer has had a chance to consider them.

July 22, 2022

Trim CPU

Filed under: Execution plans,Hash Join,Joins,Oracle,Performance,Problem Solving — Jonathan Lewis @ 6:56 am BST Jul 22,2022

Prompted by an unexpectedly high CPU usage on a hash join of two large dadta sets Stefan Koehler posted a poll on twitter recently asking for opinions on the following code fragment:

FROM
    TAB1
INNER JOIN TAB2 ON
    TAB1.COL1 = TAB2.COL1
AND TRIM(TAB1.COL3) > TRIM(TAB2.COL3)

While I struggle to imagine a realistic business requirement for the second predicate and think it’s indicative of a bad data model, I think it is nevertheless quite instructive to use the example to show how a hash join can use a lot of CPU if the join includes a predicate that isn’t on equality.

Trivia

Before examining the potential for wasting CPU, I’ll just point out two problems with using the trim() function in this way – because (while I hope that col3 is character string in both tables) I’ve seen code that uses “to_date(to_char(date_column))” instead of trunc(date_column):

Cut-n-paste from SQL*Plus:

SQL> select 1 from dual where trim(100) > trim(20);

no rows selected

==================================================================

SQL> alter session set nls_date_format = 'dd-mon-yyyy hh24:mi:ss';

SQL> select d1, d2 from t2 where trim(d1) > trim(d2);

20-jul-2022 15:24:46 19-aug-2022 15:26:44

1 row selected.

SQL> alter session set nls_date_format = 'yyyy-mm-dd hh24:mi:ss';

SQL> select d1, d2 from t2 where trim(d1) > trim(d2);

no rows selected

The trim() function converts numerics and dates to strings using the default format for the session before the comparison takes place, so not only can you get unexpected (i.e. wrong) results, two users can get contradictory results from the same data at the same time because they’ve specified different session defaults!

The CPU issue

The critical point that everyone should remember is this: hash joins can only operate on equality (though, to avoid ambiguity, one should point out that “equality” does also mean “not equals”, which is why hash anti-joins can be efficient).

This means that even though the clause “where tab1.col1 = tab2.col1 and tab1.col3 > tab2.col3” might specify the matching rows for an individual tab1 row with high precision and great efficiency for a nested loop join with the right index, a hash join has a completely different workload. Every tab1 row has to have its col3 compared with every tab2 row that matches on col1. The secondary tests multiply up to “n-squared”, and if any col1 value is is highly repetitive then the work done on checking col3 becomes excessive.

It’s easier to see this in a worked example, so here’s some sample data:

rem
rem     Script:         trim_cost.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2022
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem

create table tab1 as select * from all_Objects where owner != 'PUBLIC' and object_type != 'SYNONYM' and rownum <= 200;

create table tab2 as select * from all_Objects where owner != 'PUBLIC' and object_type != 'SYNONYM';

On a new pdb in 19.11 and 21.3 the second statement gave me roughly 46,000 rows. checking owners and row counts I got the following results:

SQL> select owner, count(*) from tab1 group by owner;

OWNER                      COUNT(*)
------------------------ ----------
SYS                             128
SYSTEM                           65
OUTLN                             7

SQL> select owner, count(*) from tab2 group by owner;

OWNER                      COUNT(*)
------------------------ ----------
SYS                           40104
SYSTEM                          417
OUTLN                             7

... plus about 17 rows aggregating 6,000 rows

And here’s the query (indicating 4 variations) that I’m going to use to demonstrate the CPU issue, followed by its execution plan and rowsource_execution_statistics:

set serveroutput off
alter session set statistics_level = all;

select
        count(*)
from
        tab1
inner join 
        tab2 
on
        tab1.owner = tab2.owner
-- and  trim(tab1.object_name)  > trim(tab2.object_name)
-- and  rtrim(tab1.object_name) > rtrim(tab2.object_name)
-- and  ltrim(tab1.object_name) > ltrim(tab2.object_name)
and     tab1.object_name > tab2.object_name
;

select * from table(dbms_xplan.display_cursor(format=>'projection allstats last'));

SQL_ID  74m49y5av3mpg, child number 0
-------------------------------------
select  count(*) from  tab1 inner join  tab2 on  tab1.owner =
tab2.owner -- and trim(tab1.object_name)  > trim(tab2.object_name) -- and rtrim(tab1.object_name) > rtrim(tab2.object_name) 
-- and ltrim(tab1.object_name) > ltrim(tab2.object_name) and tab1.object_name > tab2.object_name

Plan hash value: 2043035240

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:00.39 |     942 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:00.39 |     942 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |    101K|    329K|00:00:00.39 |     942 |  1335K|  1335K|  814K (0)|
|   3 |    TABLE ACCESS FULL| TAB1 |      1 |    200 |    200 |00:00:00.01 |       5 |       |       |          |
|   4 |    TABLE ACCESS FULL| TAB2 |      1 |  46014 |  46014 |00:00:00.01 |     937 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("TAB1"."OWNER"="TAB2"."OWNER")
       filter("TAB1"."OBJECT_NAME">"TAB2"."OBJECT_NAME")

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=0) COUNT(*)[22]
   2 - (#keys=1; rowset=407)
   3 - (rowset=256) "TAB1"."OWNER"[VARCHAR2,128], "TAB1"."OBJECT_NAME"[VARCHAR2,128]
   4 - (rowset=256) "TAB2"."OWNER"[VARCHAR2,128], "TAB2"."OBJECT_NAME"[VARCHAR2,128]

Comparing the basic colums the CPU time recorded at the Hash Join operation was 0.39 seconds, of which only a tiny amount was in the feeding tablescans. There are two things to note from the plan.

First is confirmation of my comments about the join having to be an equality and the inequality being applied later. You can see this in the Predicate Information in the way the user’s predicate list has been split at operation 2 into an access() predicate and a filter() predicate. The access predicate finds the right hash bucket and row(s) within bucket – the filter predicate is applied as a secondary test.

The second point to note is that the Column Projection Information shows us that the basic column values are passed up to the Hash Join, which tells us that the hash join operation has to do the trimming. The big question at that point is – how many times does the same value from the same incoming row get trimmed.

Remember that there are 128 rows in tab1 where where owner = ‘SYS’, so when a ‘SYS’ row arrives from tab2 the hash join has to find the right bucket then walk through the rows in that bucket (which will probably be nothing but those SYS rows). So how many times does Oracle evaluate trim(SYS). Arguably it needs to for each tab1 row in the bucket (though the hash table might have been built to include the trimmed value) but clearly it ought not to re-evaluate it 128 times for the column in the single tab2 row – and we’ll come back to that point later.

Let’s go back to the 3 variants on the first test. We were interested in the comparing trim() with trim(), but since trim() is equilavent to ltrim(rtrim()) I wondered if ltrim (left trim) and rtrim (right trim) took different amount of time, and whether the trim() time would be close to the sum of ltrim() time and rtrim() time.

Without showing the plans etc. here are the time reported in my 19.11.0.0 test at the hash join operation (the 21.3 times were very similar):

  • no trim – 0.39 seconds
  • ltrim() – 1.02 seconds
  • rtrim() – 2.66 seconds
  • trim() – 2.70 seconds.

Clearly that’s a lot of extra CPU on top of the base CPU cost. This is not entirely surprising since string operations tend to be expensive, neverthless the differences are large enough to be more than random fluctuations and operational error.

Remember that this is just two tables of 200 and 46,000 rows respectively. It turned out that the rowsources that Stefan was using were in the order of 800K and 2M rows – with CPU time increasing from 1,100 seconds to 2,970 seconds because of the trim().

So how many times was the trim() function called in total?

Faking it.

If we assume that the trim() built-in SQL function behaves in the same way as a deterministic PL/SQL function we can at least count the number of calls that take place by writing a deterministic function to put into the SQL. Something like:

create or replace package p1 as
        n1 number;
        function f1(v1 in varchar2) return varchar2 deterministic;
end;
/

create or replace package body p1 as 

        function f1 (v1 in varchar2)
        return varchar2 
        deterministic
        is
        begin
                p1.n1 := p1.n1 + 1;
                return trim(v1);
        end;

end;
/

set serveroutput off
alter session set statistics_level = all;

exec p1.n1 := 0

select
        count(*)
from
    tab1
inner join tab2 on
    tab1.owner = tab2.owner
and     p1.f1(tab1.object_name) > p1.f1(tab2.object_name)
-- and  p1.f1(tab1.object_name) > trim(tab2.object_name)
-- and  trim(tab1.object_name)  > p1.f1(tab2.object_name)
;

select * from table(dbms_xplan.display_cursor(format=>'projection allstats last'));

set serveroutput on
execute dbms_output.put_line(p1.n1);

I’ve created a package with a public variable n1 so that I can set it and read it from “outside”, then I’ve created (and lied about) a function that increments that variable and returns its input, claiming that it’s deterministic. Once I’ve got the package in place I’ve:

  • set the variable to zero
  • run a query that does one of
    • use my function twice
    • use my function once – on the build table
    • use my function once – on the probe table
  • report the execution plan with stats
  • print the value of the variable

The timings are not really important, but here’s the execution plan when I used the function on both sides of the inequality:

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation           | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |      1 |        |      1 |00:00:21.15 |    1513 |       |       |          |
|   1 |  SORT AGGREGATE     |      |      1 |      1 |      1 |00:00:21.15 |    1513 |       |       |          |
|*  2 |   HASH JOIN         |      |      1 |  23007 |    329K|00:00:21.13 |    1513 |  1335K|  1335K|  860K (0)|
|   3 |    TABLE ACCESS FULL| TAB1 |      1 |    200 |    200 |00:00:00.01 |       5 |       |       |          |
|   4 |    TABLE ACCESS FULL| TAB2 |      1 |  46014 |  46014 |00:00:00.02 |     937 |       |       |          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("TAB1"."OWNER"="TAB2"."OWNER")
       filter("P1"."F1"("TAB1"."OBJECT_NAME")>"P1"."F1"("TAB2"."OBJECT_NAME"))

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=0) COUNT(*)[22]
   2 - (#keys=1)
   3 - (rowset=256) "TAB1"."OWNER"[VARCHAR2,128], "TAB1"."OBJECT_NAME"[VARCHAR2,128]
   4 - "TAB2"."OWNER"[VARCHAR2,128], "TAB2"."OBJECT_NAME"[VARCHAR2,128]

Apart from the change of function name the plan is the same – although it now takes over 21 CPU seconds to complete, of which most of the time is probably building and tearing down the PL/SQL stack. The important figure, though is the number of function calls I saw recorded in p1.n1: it was a little over 10 million calls to generate the 329 thousand rows (A-Rows for the hash join).

When I ran the code with only one call to my deterministic function it was called 5 million times regardless of whether it was used for the build or probe table. Oracle did nothing to minimise the number of times the function was called.

Predictive Mode

Near the start of this note I showed you a little query to aggregate the rows of the two tables by owner; with a little enhancement I can reuse that code to show you exactly how many times the deterministic function was called:

select
        v1.owner, ct1, ct2, ct1 * ct2, sum(ct1 * ct2) over() tot_ct
from
        (select owner, count(object_name) ct1 from tab1 group by owner) v1,
        (select owner, count(object_name) ct2 from tab2 group by owner) v2
where
        v2.owner = v1.owner
/

OWNER                  CT1        CT2    CT1*CT2     TOT_CT
--------------- ---------- ---------- ---------- ----------
SYS                    128      40104    5133312    5160466
SYSTEM                  65        417      27105    5160466
OUTLN                    7          7         49    5160466

3 rows selected.

The number of comparisons done by the filter() predicate 5,160,466: double it to get the number of function calls. For every single one of the 40,104 SYS rows in tab2 the function was called for every single one of the SYS rows in tab1, for both sides of the inequality.

It’s a shame that Oracle doesn’t calculate and project the “virtual columns” that will be used in the join predicates, because in my case that would have reduced the number of calls from 10 million to 40,232 – a factor of roughly 250. That would probably be worth a lot of CPU to Stefan.

Damage Limitation

For my silly little query that went from 0.39 seconds to 2.70 seconds you might decide there’s no point in trying to improve things – in fact many of the sites I’ve visited probably wouldn’t even notice the CPU wastage (on one call); but when the query runs for 2,970 seconds and a little fiddling around shows that it could run in 1,100 seconds you might be inclined to see if there’s something you could do improve things.

Andrew Sayer suggested the possibility of rewriting the query with a pair of CTEs (“with” subqueries) that were forced to materialize the trim() in the CTE. The cost of physically creating the two large GTTs might well be much less than the CPU spent on the trim()ed join.

Alternatively – and dependent on the ownership and quality of the application – you could write a check constraint on each table to ensure that the column value was always equal to the trim() of the column value.

A similar option would be to add an (invisible) column to each table and use a trigger to populate the column with the trimmed value and then use the trimmed column in the query.

Conclusion

I don’t think that anything I’ve done or described in this note could be called rocket science (or telescope science as, perhaps, it should be in honour of Webb); but it has shown how much insight you can gain into what Oracle is doing and how you may be able to pin-point excess work using a few simple mechanisms that have been around for more than 15 years.

Next Page »

Website Powered by WordPress.com.