Oracle Scratchpad

April 2, 2024

Index Usage – 3

Filed under: Indexing,Oracle,Troubleshooting,Tuning — Jonathan Lewis @ 4:34 pm BST Apr 2,2024

In the second part of this series I described some of the technicalities of Index Usage Tracking and showed an example of what I was doing to test the feature. In this episode I’ll describe some of the index access methods I’ve tested and report the results. I’ve listed the tests I’ve planned so far and will show results as I find time to run the tests – if you can think of more cases add them in the comments and I’ll extend the list. (If you think a test is a very good idea, “upvote” it in the comments and I’ll try to run it sooner.

Before I’ve done all the tests I’ll add a section on Conclusions and Recommendations. This will be addressing the questions: “Is the feature worth using?” and “What’s the best way to use it”. This section of the note may develop over time as special cases or boundary conditions show up.

Setup (recap from pt.2)

I’ll be using three session to do my testing:

  • Session 1 logged in as the test user to execute test statements and query dba_index_usage.
  • Session 2 logged in as sys in the same pdb to query x$keiut and x$keiut_info/v$index_usage_info (real time summaries)
  • Session 3 logged in as the oracle s/w owner and connected as sys in the cdb$root to call keiutflush to make mmon flush x$keiut to disk. The session may have to “alter session set container = {the test pdb}” to flush the right x$keiut.
  • CDB SYS session
    • call keiutFlush twice to clear all active elements from x$keiut_info/v$index_usage_info (check contents)
  • PDB SYS session
    • check x$keiut is empty
  • User session:
    • check dba_index_usage for targetted index(es)
    • execute some statements forcing use of index and check execution plan
  • PDB sys session:
    • check contents of x$keiut and x$keiut_info/v$index_usage_Info
  • CDB sys session
    • call keiutFlush
  • End user session
    • Check contents of dba_index_usage (report changes)

Presentation

For each test I’ll show (in collapsed form) the SQL I used to define the objects being tested, and describe the purpose and method of the test. Then I’ll simply show the critical changes in dba_index_usage and, in some cases, x$keiut_info/v$index_usage_info, x$keiut as a result of the test.

Initial list of tests

  • Gathering index stats
  • “Foreign Key” indexes and primary key indexes during referential integrity
  • Single table, single column index, multiple statements
    • Unique index (equality and range predicates)
    • non-unique index covering unique constraint
    • non-unique index generally
  • Inlist iterator and union all.
  • Nested loop joins
  • Single table, Multi-column index (access/filter, and skip scan)
  • Index-only query (range scan, full scan, fast full scan, index joins)
  • Bitmap access (with multiple bitmaps per key value)
    • Bitmap and / or / minus
  • IOT accessed by primary key
  • IOT accessed by secondary key – ensuring “misses on guesses”
  • DML – single table access
  • DML – with subquery / join methods embedded
  • Locally partitioned index (single table partition, multiple table partition)
  • Globally partitioned index

Test Data

I’ve used the following script to generate common data for many of the initial tests in the list above – I may extend this script as I complete more and more of the tests, though I may also introduce separate scripts for some specific cases:

rem
rem     Script:         iut_02.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2024
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem     To be tested
rem             23.3.0.0
rem             12.2.0.1
rem
rem     Notes:
rem     Create some test data to demonstrate index usage tracking
rem

create table t1 as
select
        rownum - 1                              id,
        mod(rownum-1,10000)                     n1,
        trunc((rownum - 1)/20)                  col1,
        trunc((rownum - 1)/10)                  col2,
        rownum - 1                              col3,
        round(100 * dbms_random.normal())       rand,
        cast(rpad(rownum,25) as varchar2(25))   v1,
        cast(rpad('x',80,'x') as varchar2(80))  padding
from
        all_objects
where
        rownum <= 50000
/

create table t2 as select * from t1;

-- single column primary key index (unique)

alter table t1 add constraint t1_pk primary key (id) 
        using index (
        create unique index t1_pk on t1(id)
        );

create index t1_col3 on t1(col3);
alter table t1 add constraint t1_uk unique (col3);

create index t1_3col on t1(col1, col2, col3) compress 2;
create index t1_rand on t1(rand);

-- multi-column primary key index (unique)

alter table t2 add constraint t2_pk primary key (col1, id) 
        using index (
        create unique index t2_pk on t2(col1, id) compress
        );

-- "foreign key index" (one to many)

alter table t2 add constraint t2_fk_t1 foreign key (col3) references t1(id);
create index t2_col3 on t2(col3);

Gathering Stats

The SQL used to gather index stats from any of the calls to dbms_stats.gather_xxx_stats() does a simple select statement that is hinted to access the index. In my case the indexes were all fairly small – smaller than the critical number of blocks that trigger sampling methods – so Oracle examined every block and row in the index, leading to (a check in) index usage stats looking like the following t1 primary key example:

OBJECT_ID                      : 209180
NAME                           : T1_PK
OWNER                          : TEST_USER
TOTAL_ACCESS_COUNT             : 1
TOTAL_EXEC_COUNT               : 1
TOTAL_ROWS_RETURNED            : 50000
BUCKET_1000_PLUS_ACCESS_COUNT  : 1
BUCKET_1000_PLUS_ROWS_RETURNED : 50000
LAST_USED                      : 01-apr-2024 13:48:51

So gathering stats does flag the index as used – but with the enhanced tracking it’s going to be relatively easy to spot cases where a small number of accesses account for a large number of rows – hinting that these accesses may be only for stats collection.

If you’ve read the linked article you’ll see how Oracle’s sampling strategy has changed in recent years, so a check that will be relevant to some of your indexes is whether or not the average rows returned is equivalent to roughly 1,140 leaf blocks. Other simple checks that might be appropriate are: “is the very large access happening once per day/week” in line with your expected stats collection strategy.

Of course, if large access counts are happening fairly frequently you can raise the question – does this look an appropriate result for the table or index, or does it hint at an index that is used when it should not be, or an index that needs to be refined (e.g. by the addition of extra columns to allow better elimination before visiting the table).

Referential Integrity (and DML)

With referential integrity in place Oracle will (in the most commonly used setup)

  • check that matching child rows do not exist when you try to delete a parent or modify its key value – will either operation flag a supporting “foreign key” index as used.
  • check that the parent row exists if you try to insert a child row or update a child row to associate it with a different parent value – will either operation flag the parent primary key index as used

In my test script t1 is the parent and t2 is the child. The referential integrity is from t2.col3 to t1.id and the relevant indexes are unique indexes named t2_col3 and t1_pk respectively. Test statements are:

-- Delete parent when child exists
-- Delete parent when child deleted
-- --------------------------------

delete from t1 where id = 17000;
-- ORA-02292: integrity constraint (TEST_USER.T2_FK_T1) violated - child record found

delete from t2 where col3 = 17000;
delete from t1 where id   = 17000;

commit;

-- Insert into child without parent
-- Insert into child with    parent
-- --------------------------------

insert into t2 values(60000, 1,1,1,60000,1,'x','x');
-- ORA-02291: integrity constraint (TEST_USER.T2_FK_T1) violated - parent key not found

insert into t2 values (18000,1,1,1,15000,1,'x','x');

commit;

-- update parent to "abandon" child
-- --------------------------------

update t1 set id = 60000 where id = 25000;
-- ORA-02292: integrity constraint (TEST_USER.T2_FK_T1) violated - child record found

-- update child to change to non-existent parent
-- update child to change to pre-existing parent
-- ---------------------------------------------

update t2 set col3 = 60000 where id = 25000;
-- ORA-02291: integrity constraint (TEST_USER.T2_FK_T1) violated - parent key not found

update t2 set col3 = 30000 where id = 25000;

commit;

After calling keiutFlush and checking that there were no active elements in x$keiutinfo/v$index_usage_Info, and no rows in x$keiut I executed all the statements above one after the other (some failed, of course, with their errors shown above). Nothing was captured in x$keiut.

Apart from the implications of “foreign key” indexes not being flagged as used during referential integrity checks, the tests above also show us that updates and deletes driven by index access do not show the driving index flagged as used: t1.id = constant, t2.col3 = constant, t2.id = constant (which used an index skip scan on t2_pk)).

Single table, single column index

This set of tests is close to a repeat of the first demonstration in part 2. Here’s the list of statements aimed at index t1_pk. Note that t1_pk(id) is a unique index on a primary key constraint, t1_uk(col3) is a non-unique index covering a unique constraints, t2_col3(col3) is a single column non-unique index.

--
--      Out of range, no rows
--
select v1 from t1 where id = -1;
select v1 from t1 where id = 60001;

--
--      Single row
--
select v1 from t1 where id = 1000;
select v1 from t1 where id = 2000;

--
--      multiple rows: 5, 50, 500, 1500
--
select max(v1) from t1 where id between 11 and 15;
select max(v1) from t1 where id between 101 and 150;
select max(v1) from t1 where id between 1001 and 1500;
select max(v1) from t1 where id between 10001 and 11500;

Here are the results after the sequence: call keiutFlush, execute test SQL, report x$keiut, call keiutFlush, report change in dba_index_usage:

SQL> select objname, num_starts, num_execs, rows_returned from x$keiut;

OBJNAME                          NUM_STARTS  NUM_EXECS ROWS_RETURNED
-------------------------------- ---------- ---------- -------------
TEST_USER.T1_PK                           8          8          2057


SQL> host expand temp1.lst
OBJECT_ID                      : 209180
NAME                           : T1_PK
OWNER                          : TEST_USER
TOTAL_ACCESS_COUNT             : 8
TOTAL_EXEC_COUNT               : 8
TOTAL_ROWS_RETURNED            : 2075
BUCKET_0_ACCESS_COUNT          : 2
BUCKET_1_ACCESS_COUNT          : 2
BUCKET_2_10_ACCESS_COUNT       : 1
BUCKET_2_10_ROWS_RETURNED      : 5
BUCKET_11_100_ACCESS_COUNT     : 1
BUCKET_11_100_ROWS_RETURNED    : 50
BUCKET_101_1000_ACCESS_COUNT   : 1
BUCKET_101_1000_ROWS_RETURNED  : 500
BUCKET_1000_PLUS_ACCESS_COUNT  : 1
BUCKET_1000_PLUS_ROWS_RETURNED : 1500
LAST_USED                      : 02-apr-2024 15:29:06

You’ll note that I executed 8 select statements, and expected a total of 2,057 rows (index rowids) being passed to the table access operation, and the changes in stats shown in dba_index_usage are an exact match for the predictions above the listed SQL statements.

Since t1.col3 is an exact match of t1.id, and since t2 is a duplicate of t1, it seems likely that tests that start by cloning the SQL and changing the column or table name as appropriate would give the matching results – and they do, so I won’t bother to print them all up.

There is one final test of a single column index before I move on to simple queries targeting a multi-column index. Here’s the statement I want to test to confirm an important point:

select v1 from t1 where id between 101 and 150 and mod(id,10) = 0;

This query will scan through 50 index entries, discarding all but 5 of them, returning 5 rows from the table. The key question is this – will dba_index_usage report 50 rows accessed or 5 rows accessed. Here’s what the changes in stats looked like after the test

TOTAL_ACCESS_COUNT             : 1
TOTAL_EXEC_COUNT               : 1
TOTAL_ROWS_RETURNED            : 5
BUCKET_0_ACCESS_COUNT          : 0
BUCKET_1_ACCESS_COUNT          : 0
BUCKET_2_10_ACCESS_COUNT       : 1
BUCKET_2_10_ROWS_RETURNED      : 5
BUCKET_11_100_ACCESS_COUNT     : 0
BUCKET_11_100_ROWS_RETURNED    : 0
BUCKET_101_1000_ACCESS_COUNT   : 0
BUCKET_101_1000_ROWS_RETURNED  : 0
BUCKET_1000_PLUS_ACCESS_COUNT  : 0
BUCKET_1000_PLUS_ROWS_RETURNED : 0

This is a very important point: the stats in dba_index_usage do not tell us how many rows (index entries) we visited in the index, they tell us how many rowids (or, possibly, key values) survived to be passed to the parent operation (typically the table access). So you might look at some stats that say: “25,000 executions, all in the 2 to 10 range – good index” when you’ve got a badly designed index does 90% of the total work of a query and discards 2,000 index entries for each rowid it uses to access a table.

Inlist Iterators, Union All and mutiple references

Here’s a sample query (with its result set, and actual execution plan pulled from memory) accessing the t1 table through the index on column rand. If you try to repeat this example it probably won’t give you exactly the same results because I used Oracle’s random number generator to generate a normal distribution of integer values (with mean zero and standard deviation of 100), but there’s a reasonable chance that you’ll see similar numbers in your output as I’ve been careful to pick three values that should return significantly different numbers of rows:

select
        rand, count(*) 
from    t1 
where   rand in (100, 200, 300)
and     v1 is not null
group by 
        rand 
order by 
        count(*)
/


      RAND   COUNT(*)
---------- ----------
       300          2
       200         25
       100        114

3 rows selected.

select * from table(dbms_xplan.display_cursor(format=>'allstats last'));

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------
SQL_ID  fbwmnax39jqtq, child number 0
-------------------------------------
select         rand, count(*) from    t1 where   rand in (100, 200,
300) and     v1 is not null group by         rand order by
count(*)

Plan hash value: 874747030

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |         |      1 |        |      3 |00:00:00.01 |     143 |       |       |          |
|   1 |  SORT ORDER BY                 |         |      1 |      3 |      3 |00:00:00.01 |     143 |  2048 |  2048 | 2048  (0)|
|   2 |   SORT GROUP BY NOSORT         |         |      1 |      3 |      3 |00:00:00.01 |     143 |       |       |          |
|   3 |    INLIST ITERATOR             |         |      1 |        |    141 |00:00:00.01 |     143 |       |       |          |
|*  4 |     TABLE ACCESS BY INDEX ROWID| T1      |      3 |    220 |    141 |00:00:00.01 |     143 |       |       |          |
|*  5 |      INDEX RANGE SCAN          | T1_RAND |      3 |    220 |    141 |00:00:00.01 |       5 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter("V1" IS NOT NULL)
   5 - access(("RAND"=100 OR "RAND"=200 OR "RAND"=300))

This plan is probably the simplest demonstration of the difference between total_exec_count and total_access_count. Here’s the output from x$keiut after this test, followed by the details available from dba_index_usage after calling keiutFlush:

OBJNAME                          NUM_STARTS  NUM_EXECS ROWS_RETURNED
-------------------------------- ---------- ---------- -------------
TEST_USER.T1_RAND                         3          1           141

Given that x$keiut has reported three “starts”, and we can see that the plan shows three “Starts” and “A-rows” is 141, and the output shows three rows totalling 141 we might hope that dba_index_usage is going to show us exactly what happened with this query. Here are the results:

SQL> execute print_tABLE(q'(select * from dba_index_usage where name = 'T1_RAND')')
OBJECT_ID                      : 209738
NAME                           : T1_RAND
OWNER                          : TEST_USER
TOTAL_ACCESS_COUNT             : 3
TOTAL_EXEC_COUNT               : 1
TOTAL_ROWS_RETURNED            : 141
BUCKET_0_ACCESS_COUNT          : 0
BUCKET_1_ACCESS_COUNT          : 0
BUCKET_2_10_ACCESS_COUNT       : 0
BUCKET_2_10_ROWS_RETURNED      : 0
BUCKET_11_100_ACCESS_COUNT     : 3
BUCKET_11_100_ROWS_RETURNED    : 141
BUCKET_101_1000_ACCESS_COUNT   : 0
BUCKET_101_1000_ROWS_RETURNED  : 0
BUCKET_1000_PLUS_ACCESS_COUNT  : 0
BUCKET_1000_PLUS_ROWS_RETURNED : 0
LAST_USED                      : 02-apr-2024 19:02:03

The three “starts” from x$keiut show up in the total_access_count but the three separate accesses have been accumulated and averaged to appear as 3 accesses in the 11-100 range. (141 / 3 = 47). So there’s some loss of precision in the information. It’s better than just the old “yes/no” report, and in many cases it’s likely to give you numbers that close enough to the right ballpark to be useful, but there are likely to be some cases where the more detailed information would be more beneficial. It’s worth mentioning that the total_access_count (hence the averaging) will include “Starts” where no rows were returned.

Once upon a time, of course, a query involving an inlist might have been rewritten as a query with union all, such as:

select
        rand, count(v1)
from    t1
where   rand = 100
group by
        rand
union all
select
        rand, count(v1)
from    t1
where   rand = 200
group by
        rand
union all
select
        rand, count(v1)
from    t1
where   rand = 300
group by
        rand
;

---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |         |      1 |        |      3 |00:00:00.01 |     144 |
|   1 |  UNION-ALL                    |         |      1 |        |      3 |00:00:00.01 |     144 |
|   2 |   SORT GROUP BY NOSORT        |         |      1 |    128 |      1 |00:00:00.01 |     113 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T1      |      1 |    128 |    114 |00:00:00.01 |     113 |
|*  4 |     INDEX RANGE SCAN          | T1_RAND |      1 |    128 |    114 |00:00:00.01 |       2 |
|   5 |   SORT GROUP BY NOSORT        |         |      1 |     61 |      1 |00:00:00.01 |      27 |
|   6 |    TABLE ACCESS BY INDEX ROWID| T1      |      1 |     61 |     25 |00:00:00.01 |      27 |
|*  7 |     INDEX RANGE SCAN          | T1_RAND |      1 |     61 |     25 |00:00:00.01 |       2 |
|   8 |   SORT GROUP BY NOSORT        |         |      1 |     61 |      1 |00:00:00.01 |       4 |
|   9 |    TABLE ACCESS BY INDEX ROWID| T1      |      1 |     61 |      2 |00:00:00.01 |       4 |
|* 10 |     INDEX RANGE SCAN          | T1_RAND |      1 |     61 |      2 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("RAND"=100)
   7 - access("RAND"=200)
  10 - access("RAND"=300)

Does this make any difference to the way that Oracle reports the executions, starts, and rows returned? Here’s the change in dba_index_usage due to this query:

NAME                           : T1_RAND
TOTAL_ACCESS_COUNT             : 3
TOTAL_EXEC_COUNT               : 3
TOTAL_ROWS_RETURNED            : 141
BUCKET_0_ACCESS_COUNT          : 0
BUCKET_1_ACCESS_COUNT          : 0
BUCKET_2_10_ACCESS_COUNT       : 1
BUCKET_2_10_ROWS_RETURNED      : 2
BUCKET_11_100_ACCESS_COUNT     : 1
BUCKET_11_100_ROWS_RETURNED    : 25
BUCKET_101_1000_ACCESS_COUNT   : 1
BUCKET_101_1000_ROWS_RETURNED  : 114
BUCKET_1000_PLUS_ACCESS_COUNT  : 0
BUCKET_1000_PLUS_ROWS_RETURNED : 0

In this case three separate accesses through the t1_rand index have been recorded individually in the correct buckets – with one interesting anomaly: the total execution count is 3 rather than 1. This suggests that the “total_exec_count” is a little more complicated than I originally thought. For each execution of a single query the “total_exec_count” seems to be the “total_operation_count” and possibly, for “conditional SQL”, the count of operations that were actually called.

Nested Loops

After seeing how handled the inlist iterator, the obvious next step was to repeat the experiment but use a table to hold the values I’d selected for the inlist and force a nested loop join. Here’s the table creation followed by the query and its execution plan:

create table t3 (n1 number);

insert into t3 values(100);
insert into t3 values(200);
insert into t3 values(300);
commit;

execute dbms_stats.gather_table_stats(user,'t3')

alter table t3 add constraint t3_uk unique(n1);

select
        /*+
                leading(t3 t1)
                use_nl(t1)
                index(t1(rand))
        */
        t3.n1,
        max(t1.v1)
from
        t3, t1
where
        t1.rand = t3.n1
group by
        t3.n1
/


PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------
SQL_ID  7mxmx715yfb73, child number 0
-------------------------------------
select  /*+   leading(t3 t1)   use_nl(t1)   index(t1(rand))  */  t3.n1,
 max(t1.v1) from  t3, t1 where  t1.rand = t3.n1 group by  t3.n1

Plan hash value: 3159923631

---------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |         |      1 |        |      3 |00:00:00.01 |     147 |
|   1 |  SORT GROUP BY NOSORT         |         |      1 |      3 |      3 |00:00:00.01 |     147 |
|   2 |   NESTED LOOPS                |         |      1 |    220 |    141 |00:00:00.01 |     147 |
|   3 |    NESTED LOOPS               |         |      1 |    220 |    141 |00:00:00.01 |       9 |
|   4 |     INDEX FULL SCAN           | T3_UK   |      1 |      3 |      3 |00:00:00.01 |       2 |
|*  5 |     INDEX RANGE SCAN          | T1_RAND |      3 |     73 |    141 |00:00:00.01 |       7 |
|   6 |    TABLE ACCESS BY INDEX ROWID| T1      |    141 |     73 |    141 |00:00:00.01 |     138 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("T1"."RAND"="T3"."N1")

As you can the plan involves two indexes, and (after the usual start of called keiutFlush) executing the query ends up with x$keiut reporting two rows:

OBJNAME                          NUM_STARTS  NUM_EXECS ROWS_RETURNED
-------------------------------- ---------- ---------- -------------
TEST_USER.T1_RAND                         3          1           141
TEST_USER.T3_UK                           1          1             3

The num_starts and rows_returned happily agree with the Starts and A-rows values in the plan and after a second call to keiutFlush the changes in dba_index_usage are as follows:

NAME                           : T1_RAND
TOTAL_ACCESS_COUNT             : 3
TOTAL_EXEC_COUNT               : 1
TOTAL_ROWS_RETURNED            : 141
BUCKET_0_ACCESS_COUNT          : 0
BUCKET_1_ACCESS_COUNT          : 0
BUCKET_2_10_ACCESS_COUNT       : 0
BUCKET_2_10_ROWS_RETURNED      : 0
BUCKET_11_100_ACCESS_COUNT     : 3
BUCKET_11_100_ROWS_RETURNED    : 141
BUCKET_101_1000_ACCESS_COUNT   : 0
BUCKET_101_1000_ROWS_RETURNED  : 0
BUCKET_1000_PLUS_ACCESS_COUNT  : 0
BUCKET_1000_PLUS_ROWS_RETURNED : 0

NAME                           : T3_UK
TOTAL_ACCESS_COUNT             : 1
TOTAL_EXEC_COUNT               : 1
TOTAL_ROWS_RETURNED            : 3
BUCKET_0_ACCESS_COUNT          : 0
BUCKET_1_ACCESS_COUNT          : 0
BUCKET_2_10_ACCESS_COUNT       : 1
BUCKET_2_10_ROWS_RETURNED      : 3
BUCKET_11_100_ACCESS_COUNT     : 0
BUCKET_11_100_ROWS_RETURNED    : 0
BUCKET_101_1000_ACCESS_COUNT   : 0
BUCKET_101_1000_ROWS_RETURNED  : 0
BUCKET_1000_PLUS_ACCESS_COUNT  : 0
BUCKET_1000_PLUS_ROWS_RETURNED : 0

Again we see that the figures agree with the execution plan but, just as we saw with the inlist iterator, the level of detail reported for rows returned does go down to the individual accesses, it simply averages (for the execution) the total rows returned across the total access count to assign the work done in index t1_uk to the 11 to 100 bucket.

Prompted by the reporting of the union all query in the previous session it’s worth looking at a slight more complex join to clarify what an “exec” is. Using the same t1 and t2 tables, here’s a suitablu silly query:

select
        count(t2.v1)
from
        t2,
        t1 t1a,
        t1 t1b
where
        t1a.id = t2.n1
and     t1b.id = 10 * t2.n1
/

SQL_ID  917pamcvnmtsk, child number 0
-------------------------------------
select  count(t2.v1) from  t2,  t1 t1a,  t1 t1b where  t1a.id = t2.n1
and t1b.id = 10 * t2.n1

Plan hash value: 1623429891

----------------------------------------------------------------------------------------
| Id  | Operation            | Name  | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |       |      1 |        |      1 |00:00:00.24 |    5306 |
|   1 |  SORT AGGREGATE      |       |      1 |      1 |      1 |00:00:00.24 |    5306 |
|   2 |   NESTED LOOPS       |       |      1 |  50000 |  25000 |00:00:00.24 |    5306 |
|   3 |    NESTED LOOPS      |       |      1 |  50000 |  50000 |00:00:00.13 |    1662 |
|   4 |     TABLE ACCESS FULL| T2    |      1 |  50000 |  50000 |00:00:00.02 |     960 |
|*  5 |     INDEX UNIQUE SCAN| T1_PK |  50000 |      1 |  50000 |00:00:00.07 |     702 |
|*  6 |    INDEX UNIQUE SCAN | T1_PK |  50000 |      1 |  25000 |00:00:00.08 |    3644 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("T1A"."ID"="T2"."N1")
   6 - access("T1B"."ID"=10*"T2"."N1")

You’ll notice that there are two “index unique scan” operations, both starting 50,000 times; and that the second operation returns a row in only 25,000 of those starts thanks to the definition of n1 and the “10 x n1” in my predicate. Here’s the corresponding report from x$keiut.

OBJNAME                          NUM_STARTS  NUM_EXECS ROWS_RETURNED
-------------------------------- ---------- ---------- -------------
TEST_USER.T1_PK                      100000          2         75000

We see num_execs = 2 – i.e. one for each time the index appears in an operation in the plan. Then, as expected, num_starts and rows_returned match the sum of the two Starts and A-Rows. But here’s the change in dba_index_usage – spot the oddity:

NAME                           : T1_PK
OWNER                          : TEST_USER
TOTAL_ACCESS_COUNT             : 100000
TOTAL_EXEC_COUNT               : 2
TOTAL_ROWS_RETURNED            : 75000
BUCKET_0_ACCESS_COUNT          : 50000
BUCKET_1_ACCESS_COUNT          : 50000

Oracle reports the “execute count” of 2; and the total rows returned as 75000 – but it seems that that 75,000 is made up of 50,000 accesses that returned 1 row (which we know happened), and 50,000 accesses that returned 0 rows (which is not the way things went).

It’s going to be very easy for me to fool myself that I know what Oracle is doing here – it’s going to take several more experiments to get a good idea of what information Oracle is recording and when and how it’s accumulating it (or not) between calls to keiutFlush.

In this simple case where we’ve executed one query with two operations it would be easy to assume that Oracle has identified and summarised the two operations separately, introducing some simple rounding errors.

  • The first operation starts 50,000 times and returns 50,000 rows so, on average, it should be counted in the “1 access” bucket 50,000 times.
  • The second operation starts 50,000 times and returns 25,000 rows so, on average and with rounding (down), that’s 0 rows per start. Hence a count of 50,000 in the “0 access” bucket.

It seems like a viable explanation – but we might want to fiddle with slightly different queries, and do some tests of multiple executions of the same query (while the data changes), and some tests with slightly different but similarly structured queries, to see if Oracle is accumulating on a “per query execution basis”, or doing something more complex, or something more naive.

The other thing to note about the way that dba_index_usage has captured the statistics of this report is that it looks as if it depends either on rowsource execution statistics or the SQL Monitor statistics being gathered on every execution – and that could lead to significant overheads (in CPU, most probably).

Conclusions and Suggestions

Calls to gather index stats execute queries that will be captured by index usage tracking, so you need a strategy to help you ignore “BUCKET_1000_PLUS_ROWS_RETURNED” accesses that are about stats collection, while taking action when they are indications of index usage that could be made more efficient.

Indexes used during referential integrity checks are not flagged as used. This is probably not much of a threat for the relevant (primary/unique key) indexes on the parent end since you are unlikely to drop indexes that enforce uniqueness even if they don’t appear to be used; however it does mean that there is some risk of dropping a “foreign key” index that is needed to avoid locking problems.

Indexes used to drive delete and update statements are not captured by the new mechanism – at least for simple “delete from single_table where …” and “update table where …”. This could result in indexes being dropped that have been created to enhance DML performance. (There are outstanding tests for DML using index-driven subqueries or join views to see if indexes used to access other tables in the statement do get flagged.)

The statistics about “rows returned” tell you how many index entries are passed to the parent operation, not about the number of index entries examined; so a frequently used index that consistently reports a very small number of “rows returned” is not necessarily an efficient, well-designed index.

The stats on inlist iterators are a little disappointing: while the number of iterations appears as the total_access_count, the accesses are all allocated to the same bucket on the basis of total_rows_returned / total_access_count and no allowance is made for different values in the list returning significantly different numbers of rows. The same is true of the index accesses for the second table of a nested loop join. On the other hand if an inlist query is rewritten as a union all query then the stats for each branch of the union all are captured separately – this means access counts and rows returned are reported against the correct bucket, but as a side effect the execution count is reported as the number of branches that have used the index, executed rather than the number of executions of the query.

Tests with slightly more complex queries show that the total_exec_count relates to the number of times an index is named in an execution plan operation, not just the number of times the query is executed; and the slightly more complex queries make it easy to see side effects of the rounding of “total_row_count / total_access_count” (particularly the “0 bucket” and “1 bucket” stats) that make the overall results look strangely inconsistent. In my (current) opinion, though, it’s probably the case that the 0/1 level where the inconsistency is most visible is where it’s also least important.

June 13, 2023

Case Study

Filed under: Oracle,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 12:50 pm BST Jun 13,2023

Here’s a problem that appeared on the Oracle-L list server a little while ago. It features an Exadata system with a “very large” GTT (global temporary table) and a pair of queries that are repeated frequently, using a different literal value on each repetition of the pair.

The queries, with sample execution plans, have been reported using the SQL Monitor feature and are available on github, but I’ll repeat them here, individually, for convenience. The first query is a very simple single-table with aggregation:

SELECT FG.FID,   FG.DCD, FG.MDT,1.00 , SUM (FG.N_CD), SUM (FG.d_nsd), and several more summed columns
    FROM F_GTT FG
   WHERE FG.FID = {literal constant}
GROUP BY FG.FID, FG.DCD, FG.MDT
ORDER BY 1, 3


Global Information
------------------------------
 Status              :  DONE (ALL ROWS)           
 Instance ID         :  3                         
 SQL ID              :  gmm2v6dtvu0yb             
 SQL Execution ID    :  50331648                  
 Execution Started   :  06/01/2023 14:52:21       
 First Refresh Time  :  06/01/2023 14:52:25       
 Last Refresh Time   :  06/01/2023 14:53:11       
 Duration            :  50s                       
 Fetch Calls         :  1                         

Global Stats
==================================================================================================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  |  Other   | Fetch | Buffer | Read  | Read  | Uncompressed |  Offload   |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Waits(s) | Calls |  Gets  | Reqs  | Bytes |    Bytes     | Elig Bytes | Returned Bytes |
==================================================================================================================================================
|      50 |      48 |     0.68 |        0.01 |     0.00 |     0.72 |     1 |     3M | 45632 |  22GB |         22GB |       22GB |           22GB |
==================================================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=3209725901)
==============================================================================================================================================================================
| Id |          Operation           |      Name      |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read  | Read  |  Mem  | Activity |      Activity Detail      |
|    |                              |                | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs  | Bytes | (Max) |   (%)    |        (# samples)        |
==============================================================================================================================================================================
|  0 | SELECT STATEMENT             |                |         |      |        47 |     +4 |     1 |       31 |       |       |     . |          |                           |
|  1 |   SORT GROUP BY              |                |     217 | 792K |        47 |     +4 |     1 |       31 |       |       | 14336 |          |                           |
|  2 |    TABLE ACCESS STORAGE FULL | F_GTT          |   63836 | 792K |        50 |     +1 |     1 |    37491 | 45632 |  22GB |  14MB |    97.96 | Cpu (46)                  |
|    |                              |                |         |      |           |        |       |          |       |       |       |          | cell smart table scan (2) |
==============================================================================================================================================================================

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - storage("FG"."FID"={literal value})
       filter("FG"."FID"={literal value})
	   

It’s not a difficult query, and it’s easy to see that almost all the time is spent in a tablescan that reads 22GB to identify 37,491 rows, with fractions of a second spent aggregating those rows. And of the time spent in the tablescan almost all of it is CPU time (rather than waits for I/O).

Details to note:

  • We’ve done 45.000 read requests for 22GB of data read, which suggests we’re doing reads of 500KB (rather than the 1MB that you might have expected).
  • We’ve done 3M buffer gets – which is a good match for 22GB of data read with an 8KB block size, so it looks like we’re doing a simple tablescan into buffer cache. (This is a little unexpected since we spend time on “cell smart tablescan” so it would be nice to see the session statistics for one execution so that we could see the underlying reason for these buffer gets).
  • We don’t seem to have saved any data traffic through storage indexes (but that might be because Oracle doesn’t create storage indexes for temporary tablespaces – that’s a detail I’d have to check).

It would be nice to know, of course, how many rows there were in the table, and how many distinct values of FID, and how variable the number of rows per FID was across the table. Based on the little we know, though, and assuming that this single execution represents a fairly typical value of FID, the basic question seems to be: would we benefit from defining an appropriate index on the table, though there are several other ways that you could choose to improve the performance if you had a completely free hand and could make whatever simple changes you liked in the surrounding code.

We can note, of course, that with a suitable index we might have done about 37,500 (single block) read requests compared to 45,000 multi-block read requests, and that looks like a small win on I/O but with a potentially big win on CPU. The obvious index is on (FID), defined with “compress 1” but if we created the index on (FID, MDT, DCD) we might get some extra benefit since Oracle could walk the index using a “SORT GROUP BY NOSORT” operation to aggregate and return the data in the right order.

Note: the SQL says: select fid, dcd, mdt … group by fid, dcd, mdt order by 1,3 – but Oracle has assumed that the order by clause can be ignored because of the group by (the optimizer’s allowed to do that, programmers aren’t). This means the optimizer has recognized that it could the query to group by fid, dcd, mdt order by 1,2 to produce the same result.

Since the initial 37,491 rows for the FID collapsed to 31 rows after aggregation might find that we could even create the larger index with “compress 3”.

Here’s another thought that goes with the idea of creating the perfect index for walking the data and avoiding sorting for every execution of the query: this table is re-populated before the queries are executed so could we populate the table with pre-ordered data so that all the rows for any individual FID are packed into the smallest number of blocks?

Of course, changes like adding an index and pre-sorting data might have significant side effects in other parts of the code. What we’re trying to do here is find big wins using simple strategies that avoid complicated code changes – we need to be sure that we haven’t introduced side-effects that mean we’ve moved the problem rather than solving it.

We should keep our thoughts about the first query in mind as we move on to the second query. This one is a very simple join and aggregation (via distinct) which features the same very large tablescan:

SELECT  DISTINCT ft.fT_NM, FGTT.DCD, ft.COl5, ft.FT_CID, RGTT.mcg_ind,   C.D_CD
    FROM rt_gtt         RGTT,
         F_GTT          FGTT,
         RD_FTIN ft,
         RDC           c
   WHERE     FGTT.eid = RGTT.nid
         AND FGTT.etyp = RGTT.nety
         AND FGTT.FID = ft.ftid
         AND FGTT.FID = {literal value}
         AND FGTT.DCD = c.DCD
         AND ft.FT_CID = RGTT.mcid
ORDER BY ft.fT_NM, DCD

	   
	   Global Information
------------------------------
 Status              :  DONE (ALL ROWS)           
 Instance ID         :  3                         
 SQL ID              :  c3y3n73an8b69             
 SQL Execution ID    :  50331648                  
 Execution Started   :  06/01/2023 14:54:39       
 First Refresh Time  :  06/01/2023 14:54:43       
 Last Refresh Time   :  06/01/2023 14:55:17       
 Duration            :  38s                       
 Fetch Calls         :  1                         

Global Stats
==================================================================================================================================================
| Elapsed |   Cpu   |    IO    | Application | Cluster  |  Other   | Fetch | Buffer | Read  | Read  | Uncompressed |  Offload   |    Offload     |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Waits(s) | Calls |  Gets  | Reqs  | Bytes |    Bytes     | Elig Bytes | Returned Bytes |
==================================================================================================================================================
|      39 |      37 |     0.57 |        0.01 |     0.00 |     0.59 |     1 |     4M | 45640 |  22GB |         22GB |       22GB |           22GB |
==================================================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=891026795)
========================================================================================================================================================================================
| Id |             Operation              |             Name             |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read  | Read  |  Mem  | Activity | Activity Detail |
|    |                                    |                              | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs  | Bytes | (Max) |   (%)    |   (# samples)   |
========================================================================================================================================================================================
|  0 | SELECT STATEMENT                   |                              |         |      |         1 |    +38 |     1 |        1 |       |       |     . |          |                 |
|  1 |   SORT UNIQUE                      |                              |   19196 |   6M |         1 |    +38 |     1 |        1 |       |       |  2048 |          |                 |
|  2 |    NESTED LOOPS                    |                              |         |      |         1 |    +38 |     1 |      380 |       |       |     . |          |                 |
|  3 |     NESTED LOOPS                   |                              |   19196 |   6M |         1 |    +38 |     1 |      380 |       |       |     . |          |                 |
|  4 |      NESTED LOOPS                  |                              |      94 |   6M |         1 |    +38 |     1 |        2 |       |       |     . |          |                 |
|  5 |       NESTED LOOPS                 |                              |      1M |   4M |         1 |    +38 |     1 |        2 |       |       |     . |          |                 |
|  6 |        VIEW                        | VW_DTP_E7E675E0              |      3M | 815K |         1 |    +38 |     1 |        2 |       |       |     . |          |                 |
|  7 |         HASH UNIQUE                |                              |      3M | 815K |        35 |     +4 |     1 |        2 |       |       |  11MB |          |                 |
|  8 |          TABLE ACCESS STORAGE FULL | F_GTT                        |      3M | 792K |        39 |     +0 |     1 |    27239 | 45638 |  22GB |  14MB |   100.00 | Cpu (38)        |
|  9 |        TABLE ACCESS BY INDEX ROWID | rt_gtt                       |       1 |    2 |         1 |    +38 |     2 |        2 |       |       |     . |          |                 |
| 10 |         INDEX RANGE SCAN           | rt_gtt_IX5                   |       1 |    1 |         1 |    +38 |     2 |        2 |       |       |     . |          |                 |
| 11 |       TABLE ACCESS BY INDEX ROWID  | RD_FTIN                      |       1 |    3 |         1 |    +38 |     2 |        2 |       |       |     . |          |                 |
| 12 |        INDEX RANGE SCAN            | RD_FTIN_IX1                  |       1 |    2 |         1 |    +38 |     2 |        2 |       |       |     . |          |                 |
| 13 |      INDEX RANGE SCAN              | RDC_IX1                      |     206 |    1 |         1 |    +38 |     2 |      380 |     2 | 16384 |     . |          |                 |
| 14 |     TABLE ACCESS BY INDEX ROWID    | RDC                          |     204 |  169 |         1 |    +38 |   380 |      380 |       |       |     . |          |                 |
========================================================================================================================================================================================

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - storage("FGTT"."FID"={literal value})
       filter("FGTT"."FID"={literal value})
   9 - filter("ITEM_3"="RGTT"."nety")
  10 - access("ITEM_4"="RGTT"."nid")
  11 - filter("FT"."FT_CID"="RGTT"."mcid")
  12 - access("ITEM_2"="FT"."ftid")
  13 - access("ITEM_1"="C"."DCD")

Again we have a query that selects all the rows for a given FID from the global temporary table, but this time it joins them to three other tables before eliminating duplicates. Unsurprisingly a very large fraction of the time is spent on the tablescan with the same 45,000 read requests pushing 22GB across the interconnect to the buffer cache, with most of the time recorded as CPU time not I/O wait time.

There’s a very interesting feature of this plan, though. The SQL statement says “join then make distinct”, but the plan reports “make distinct then join” – applying a hash unique operation to the GTT data reducing 27,239 rows to 2 rows before joining the other three tables. (I’ve written about the “group by placement” transformation in the past, this is the equivalent transformation for select distinct.)

Note: it’s a little odd that two queries running just one and a half minutes apart from each other should have estimates which are out by a factor of two in one case and 100 in the other for a simple “column = constant” predicate. This might be something to do with “out of range” values, or a side effect of a hybrid histogram with an unlucky sample size.

The only thing that really matters in this query is the tablescan of the GTT, and obviously we could again consider creating an index on (FID) as a possible way to reduce the workload, giving us a single extra index that might benefit both queries. In this case, though, there are (only) three other column from the GTT appearing in the query, so we could create an index that included those column so that Oracle didn’t have to visit the table at all. For example (FID, eid, etyp, dcd) – with some level of compression that we’d have to decide on by reviewing the data.

Of course that does leave us wanting two new indexes on the same large table that have to be maintained as the data is loaded (remember that you can’t create indexes on GTT’s after the data has been inserted). So we ought to look for alternative strategies.

Again the reduction from tens of thousands of rows to “just a few” rows prompts an idea. I’m guessing that Oracle aggregates (distinct) the GTT table down with the equivalent of an inline view: “select distinct fid, eid, etyp, dcd from … “ (the 10053 (CBO) trace might show this as one of the “unparsed” queries). If that’s the case, why not have a summary GTT that emulates a “refresh complete materialized view” on the main GTT and populate it with the result of that select distinct immediately after populating the main GTT; then reference the summary GTT in the join query instead of the main GTT. With a reduction by a factor of around 10,000 in the number of rows and a further reduction due to the column restriction, the tablescan might drop from 22GB to just a couple of (cached) megabytes.

With the “materialized view” concept in mind, we could revisit the first query. The big problem with the current coding strategy is that we populate a large table once, then keep executing a pair of queries that are interested in one FID at a time. Maybe we could put a wrapper around the whole thing to populate two pre-calculated, small, summary tables, and change then tweak the two problem queries to reference their corresponding “materialized views”.

Summary

The system populates a very large GTT, then seems to be running a loop to execute two queries per “FID” value. Both queries spend most of their time doing a tablescan of the large GTT before aggregating a few thousand rows down to a handful of rows.

A basic strategy for improving performance would be to create two new indexes on this table (though one might be “good enough” for both queries in the short term); the indexing strategy could be made more effective if the code to load the GTT could be modified to pre-sort the data before loading it.

An alternative strategy that takes advantage of the massive drop in volume as the data is aggregated is pre-load two (much smaller) “summary” GTTs after loading the main GTT and before starting the loop through the FIDs. This would require the two queries to be edited to reference their respective summary tables. (The summary tables could, of course, be indexed to add a little extra benefit.)

November 21, 2022

Row_number() sorts

Filed under: Oracle,Troubleshooting,Tuning,Upgrades — Jonathan Lewis @ 5:47 pm GMT Nov 21,2022

An email on the Oracle-L list server a few days ago described a performance problem that had appeared after an upgrade from 11.2.0.4 to 19c (19.15). A long running statement (insert as select, running parallel 16) that had run to completion in 11g using about 20GB of temporary space (with 50GB read and written) had failed after running for a couple of hours in 19c and consuming 2.5 TB of temporary space even when the 11g execution plan was recreated through an SQL Profile.

When I took a look at the SQL Monitor report for 19c it turned out that a large fraction of the work done was in an operation called WINDOW CHILD PUSHED RANK which was there to deal with a predicate:

row_number() over(partition by t.ds_no, t.c_nbr order by c.cpcl_nbr desc) = 1

Checking the succesful 11g execution, this operation had taken an input rowsource of 7 Billion rows and produced an output rowsource of 70 Million rows.

Checking the SQL Monitor report for the failed executions in 19c the “pure” 19c plan had reported 7 billion input rows, 6GB memory allocated and 1TB of temp space at the same point, the plan with the 11g profile had reported 10 billion input rows, but the operation had not yet reported any output rows despite reporting 9GB as the maximum memory allocation and 1TB as the maximum temp space usage. (Differences in row counts were probably due to the report being run for different dates.)

So, the question to the list server was: “is this a bug in 19c?”

Modelling

It’s a little unfortunate that I couldn’t model the problem in 19c at the time because my 19c VM kept crashing; but I built a very simple model to allow me to emulate the window sort and row_number() predicate in an 11g instance, then re-played the model in an instance of 21c.

For the model data I took 50 copies of the first 50,000 rows from view all_objects to produce a table of 2,500,000 rows covering 35,700 blocks and 279 MB, (55,000 blocks / 430 MB in 21c); then I ran the query below and reported its execution plan with a basic call to dbms_xplan.display_cursor():

select
        /*+ dynamic_sampling(0) */
        owner, max(object_name)
from    (
        select 
                /*+ no_merge */
                owner, object_name 
        from    (
                select 
                        owner, object_name,
                        row_number() over (partition by object_name order by object_type desc) orank 
                from 
                        t1
                )  where orank= 1
        )
group by 
        owner
order by
        owner
/

-------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |       |       |       | 29491 (100)|          |
|   1 |  SORT GROUP BY             |      |     8 |   184 |       | 29491   (9)| 00:02:28 |
|   2 |   VIEW                     |      |  2500K|    54M|       | 28532   (6)| 00:02:23 |
|*  3 |    VIEW                    |      |  2500K|   112M|       | 28532   (6)| 00:02:23 |
|*  4 |     WINDOW SORT PUSHED RANK|      |  2500K|    95M|   124M| 28532   (6)| 00:02:23 |
|   5 |      TABLE ACCESS FULL     | T1   |  2500K|    95M|       |  4821   (8)| 00:00:25 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY
              INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Oracle 21c produced the same execution plan – though the row estimate for the VIEW operations (numbers 2 and 3) was a more realistic 46,236 (num_distinct recorded for object_name) compared to the unchanged 2,500,000 from 11g. (Of course it should have been operation 4 that showed the first drop in cardinality.)

With my first build, the timings weren’t what I expected: In 21c the query completed in 3.3 seconds, in 11g it took 11.7 seconds. Most of the difference was due to a large (55MB) spill to temp space that appeared in 11g but not in 21c. This would have been because myb11g wasn’t allowed a large enough PGA, so I set the workarea_size_policy to manual and the sort_area_size to 100M, which looks as if it should have been enough to cover the 11g requirement – it wasn’t and I had to grow the sort_area_size to 190 MB before the 11g operation completed in memory, allocating roughly 155MB. By comparison 21c reported an increase of only 19MB of PGA to run the query, claiming that it needed only 4.7MB to handle the critical operation.

For comparison purposes here are the two run-time execution plans, with rowsource execution stats (which messed the timing up a little) and the column projection information;

Results for 11g

-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 29491 (100)|      8 |00:00:03.96 |   35513 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |      8 |   184 |       | 29491   (9)|      8 |00:00:03.96 |   35513 |  3072 |  3072 | 2048  (0)|
|   2 |   VIEW                     |      |      1 |   2500K|    54M|       | 28532   (6)|  28575 |00:00:04.07 |   35513 |       |       |          |
|*  3 |    VIEW                    |      |      1 |   2500K|   112M|       | 28532   (6)|  28575 |00:00:03.93 |   35513 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|    95M|   124M| 28532   (6)|   1454K|00:00:08.82 |   35513 |   189M|  4615K|  168M (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|    95M|       |  4821   (8)|   2500K|00:00:10.85 |   35513 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1) "OWNER"[VARCHAR2,30], MAX("OBJECT_NAME")[30]
   2 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30]
   3 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30], "ORANK"[NUMBER,22]
   4 - (#keys=2) "OBJECT_NAME"[VARCHAR2,30], INTERNAL_FUNCTION("OBJECT_TYPE")[19], "OWNER"[VARCHAR2,30], ROW_NUMBER() OVER ( PARTITION BY
       "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )[22]
   5 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30], "OBJECT_TYPE"[VARCHAR2,19]

It’s an interesting oddity, and possibly a clue about the excess memory and temp space, that the A-Rows column for the Window Sort operation reports 1,454K rows output when it surely ought to be the final 45,982 at that point. It’s possible to imagine a couple of strategies that Oracle might be following to do the window sort that would reasult in the excess volume appearing, and I’ll leave it to the readers to investigate that

Results for 21c

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 48864 (100)|     12 |00:00:02.98 |   54755 |  54750 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |     12 |   852 |       | 48864   (1)|     12 |00:00:02.98 |   54755 |  54750 |  5120 |  5120 | 4096  (0)|
|   2 |   VIEW                     |      |      1 |  46236 |  3205K|       | 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |       |       |          |
|*  3 |    VIEW                    |      |      1 |  46236 |  6547K|       | 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|   131M|   162M| 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |  5297K|   950K| 4708K (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|   131M|       | 15028   (1)|   2500K|00:00:00.28 |   54755 |  54750 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1; rowset=256) "OWNER"[VARCHAR2,128], MAX("OBJECT_NAME")[128]
   2 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128]
   3 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128], "ORANK"[NUMBER,22]
   4 - (#keys=2; rowset=256) "OBJECT_NAME"[VARCHAR2,128], "OBJECT_TYPE"[VARCHAR2,23], "OWNER"[VARCHAR2,128], ROW_NUMBER() OVER ( PARTITION BY
       "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )[22]
   5 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128], "OBJECT_TYPE"[VARCHAR2,23]

In this case we see the A-rows from the Window Sort meeting our expectations – but that may be a beneficial side effect of the operation completing in memory.

Given the dramatically different demands for memory for a query that ought to do the same thing in both versions it looks as if 21c may be doing something clever that 11g doesn’t do, or maybe doesn’t do very well, or maybe tries to do but has a bug that isn’t dramatic enough to be obvious unless you’re looking closely.

Modelling

Here’s a script that I used to build the test data, with scope for a few variations in testing. You’ll notice that the “create table” includes an “order by” clause that is close to the sorting requirement of the over() clause that appears in the query. The results I’ve show so far were for data that didn’t have this clause in place.

rem
rem     Script:         analytic_sort_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2022
rem
rem     Last tested
rem             21.3.0.0
rem             19.11.0.0
rem             12.2.0.1
rem             11.2.0.4
rem

create table t1 nologging 
as
select 
        ao.*
from
        (select * from all_objects where rownum <= 50000) ao,
        (select rownum from dual connect by rownum <= 50)
order by
        object_name, object_type -- desc
/

--
--      Stats collection to get histograms
--

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 254'
        );
end;
/

--
-- reconnect here to maximise visibility of PGA allocation
--

connect xxxxxxxx/xxxxxxxx

set linesize 180
set trimspool on
set tab off

-- alter session set workarea_size_policy = manual;
-- alter session set sort_area_size = 199229440;

alter session set events '10046 trace name context forever, level 8';
-- alter session set statistics_level = all;
-- alter session set "_rowsource_execution_statistics"= true;

spool analytic_sort_2

select
        /*  monitoring */
        owner, max(object_name)
from    (
        select 
                /*+ no_merge */
                owner, object_name 
        from    (
                select 
                        owner, object_name,
                        row_number() over (partition by object_name order by object_type desc) orank 
                from 
                        t1
                )  where orank= 1
        )
group by 
        owner
order by
        owner
/

select * from table(dbms_xplan.display_cursor(format=>'cost bytes allstats last projection'));

alter session set events '10046 trace name context off';
alter session set "_rowsource_execution_statistics"= false;
alter session set statistics_level = typical;
alter session set workarea_size_policy = auto;

spool off

The results I’m going to comment on now are the ones I got after running the script with the order by clause in place, then reconnecting and flushing the shared pool before repeat the second half of the script (i.e. without recreating the table).

In 11g, going back to the automatic workarea sizing the session used 37MB of memory and then spilled (only) 3MB to temp. The run time was approximately 3 seconds – which is a good match for the “unsorted” 21c run time. As with the original tests, the value reported in A-rows is larger than we would expect (in this case suspiciously close to twice the correct values – but that’s more likely to be a coincidence than a clue). Interestingly, when I switched to the manual workarea_size_policy and set the sort_area_size to 190MB Oracle said “that’s the optimum memory” and used nearly all of it to complete in memory – for any value less than that (even down to 5MB) Oracle spilled just 3 MB to disk in a one-pass operation. So it looks as if Oracle “knows” it doesn’t need to sort the whole data set, but still uses as much memory as is available to do something before it starts to get clever.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 29491 (100)|      8 |00:00:01.76 |   35523 |   2145 |    331 |       |       |          |         |
|   1 |  SORT GROUP BY             |      |      1 |      8 |   184 |       | 29491   (9)|      8 |00:00:01.76 |   35523 |   2145 |    331 |  2048 |  2048 | 2048  (0)|         |
|   2 |   VIEW                     |      |      1 |   2500K|    54M|       | 28532   (6)|  28575 |00:00:02.00 |   35523 |   2145 |    331 |       |       |          |         |
|*  3 |    VIEW                    |      |      1 |   2500K|   112M|       | 28532   (6)|  28575 |00:00:01.83 |   35523 |   2145 |    331 |       |       |          |         |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|    95M|   124M| 28532   (6)|  57171 |00:00:02.10 |   35523 |   2145 |    331 |  2979K|   768K|   37M (1)|    3072 |
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|    95M|       |  4821   (8)|   2500K|00:00:11.84 |   35513 |   1814 |      0 |       |       |          |         |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In 21c there’s essentially no difference between the sorted and unsorted tests, which suggests that with my data the session had been able to apply its optimisation strategy at the earliest possible moment rather than waiting until it had no alternative but to spill to disc.

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 48864 (100)|     12 |00:00:00.98 |   54753 |  54748 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |     12 |   852 |       | 48864   (1)|     12 |00:00:00.98 |   54753 |  54748 |  4096 |  4096 | 4096  (0)|
|   2 |   VIEW                     |      |      1 |  46236 |  3205K|       | 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |       |       |          |
|*  3 |    VIEW                    |      |      1 |  46236 |  6547K|       | 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|   131M|   162M| 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |  5155K|   940K| 4582K (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|   131M|       | 15028   (1)|   2500K|00:00:00.42 |   54753 |  54748 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Bug description: possibly

Given the way that 11g reports a very small spill to disc, which stays fairly constant in size no matter how large or small the available PGA allocation is, when the input data is sorted to help the over() clause, and given how large the spill to disc can become when the data is not sorted, I feel that Oracle has an optimisation that discards input rows early in the analytic window sort. But we also have some evidence of a flaw in the code in versions prior to 21c that means Oracle fails to re-use memory that becomes available from rows that have been discarded.

This means the OP’s problem may have been just bad luck in terms of available memory and (relatively) tiny variations in demands for space between the 11g and 19c instances perhaps due to differences in the quantity or distribution of data.

Although the impact was dramatic in this case, a query that is supposed to return 70 million rows (irrespective of how many it starts with) is an extreme case, and one that deserves a strong justification and a significant investment in time spent on finding cunning optimisation strategies.

So maybe this is a bug that doesn’t usually get noticed that will go away on an upgrade to 21c; and maybe there’s a backport and patch already available if you can find a bug number in the 21c patch release notes.

Strategy

I’ve said in the past that if you’re using analytic functions you ought to minimise the size of the data you’re processing before you apply the analytic part. Another step that can help is to make sure you’ve got the data into a (fairly well) sorted order before you reach the analytic part.

In the case of versions of Oracle prior to 21c, it also seems to make sense (if you can arrange it) to minimise the amount of memory the session is allowed to use for a sort operation, as this will reduce the CPU used by the session and avoid grabbing excess redundant memory that could be used more effectively by other sessions.

Addendum

Just before publishing I found a way of keeping my 19.11.0.0 instance alive long enough to run the tests, then also ran them on an instance of 12.2.0.1. Both versions showed the same pattern of doing a large allocation of memory and large spill to disc when the data was not sorted, and a large allocation of memory but a small spill to disc when the data was sorted.

As a little sanity check I also exported the 19c data and imported it to 21c in case it was a simple variation in the data that allwoed made 21c to operate more efficiently than19c. The change in data made no difference to the way in which 21c handled it, in both cases it called for a small allocation of memory with no spill to disc.

November 15, 2022

opt_estimate 4a

Filed under: CBO,Execution plans,Hints,Oracle,Tuning — Jonathan Lewis @ 11:21 am GMT Nov 15,2022

I wrote a batch of notes about the opt_estimate() hint a couple of years ago, including one where I explained the option for using the hint to specify the number of rows in a query block. I’ve just come across a particular special case for that strategy that others might find a use for. It’s something to do when using the “select from dual … connect by” trick for multiplying rows.

Here’s a little data to model the idea – I’ve used the all_tables view to generate some “well-known” data since I want to add a tiny bit of complexity to the query while still leaving it easy to understand the index. The results from this demonstration come from Oracle 21.3.0.0, and I’ve included the hint /*+ no_adaptive_plan */ to stop Oracle from getting too clever during optimisation.

rem
rem     Script:         opt_estimate_dual.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2022
rem
rem     Last tested 
rem             21.3.0.0
rem

create table tables_table as select * from all_objects where object_type = 'TABLE';
create table objects_table as select * from all_objects;

alter table objects_table add constraint ot_pk primary key(object_id);

begin
        dbms_stats.gather_table_stats(
                ownname    => user,
                tabname    => 'tables_table',
                method_opt => 'for columns size 60 owner'
        );
end;
/

set serveroutput off

with driver as (
        select  /*+ materialize */
                tt.owner, tt.object_id, v1.rn
        from    tables_table tt,
                (
                select
                        /*+  opt_estimate(query_block scale_rows=10) */
                        rownum rn
                from    dual
                connect by
                        level <= 10
                ) v1
        where
                tt.owner = 'OUTLN'
)
select  /*+ no_adaptive_plan */
        dr.rn, dr.owner, dr.object_id,
        ot.object_id, ot.owner, ot.object_type, ot.object_name
from
        driver dr,
        objects_table   ot
where
        ot.object_id = dr.object_id
/

select * from table(dbms_xplan.display_cursor(format => 'hint_report'));


In my system tables_table holds 727 rows and objects_table holds 58383 rows. Three rows in tables_table correspond to tables owned by user ‘OUTLN’ which means I expect the driver CTE (common table expression / “with” subquery) to generate 30 rows and, given the join on unique id, the query to return 30 rows.

I’ve used the /*+ materialize */ hint to force Oracle to create an in-memory temporary table for the driver CTE, the /*+ no_adaptive_plan */ hint to stop Oracle from getting too clever during optimisation, and the critical /*+ opt_estimate() */ hint to help the optimizer understand the effect of my “connect by” on dual. Here’s the execution plan I get if I don’t include that last hint:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |    14 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6632_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |     3 |    78 |     9   (0)| 00:00:01 |
|   4 |     VIEW                                 |                            |     1 |    13 |     2   (0)| 00:00:01 |
|   5 |      COUNT                               |                            |       |       |            |          |
|   6 |       CONNECT BY WITHOUT FILTERING       |                            |       |       |            |          |
|   7 |        FAST DUAL                         |                            |     1 |       |     2   (0)| 00:00:01 |
|   8 |     BUFFER SORT                          |                            |     3 |    39 |     9   (0)| 00:00:01 |
|*  9 |      TABLE ACCESS FULL                   | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|  10 |   NESTED LOOPS                           |                            |     3 |   453 |     5   (0)| 00:00:01 |
|  11 |    NESTED LOOPS                          |                            |     3 |   453 |     5   (0)| 00:00:01 |
|  12 |     VIEW                                 |                            |     3 |   276 |     2   (0)| 00:00:01 |
|  13 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6632_31D19D4 |     3 |    78 |     2   (0)| 00:00:01 |
|* 14 |     INDEX UNIQUE SCAN                    | OT_PK                      |     1 |       |     0   (0)|          |
|  15 |    TABLE ACCESS BY INDEX ROWID           | OBJECTS_TABLE              |     1 |    59 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - filter("TT"."OWNER"='OUTLN')
  14 - access("OT"."OBJECT_ID"="DR"."OBJECT_ID")

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   0 -  STATEMENT
           -  no_adaptive_plan

   2 -  SEL$1
           -  materialize


I’ve highlighted operations 4 and 8 in the plan: operation 4 is the view of dual that has generated 10 rows – unfortunately the optimizer has only considered the stats of the dual table, and hasn’t factored in the effects of the “connect by with rownum”. Operation 8 shows us that the optimizer has (correctly, thanks to the histogram I requested) estimated 3 rows for the tablescan of tables_table. The result of these two estimates is that operation 3 reports an estimate of 3 ( = 3 * 1 ) rows to be used in probing objects_table.

This is the plan after enabling the /*+ opt_estimate() */ hint:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |    45 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6633_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |    30 |   780 |    13   (0)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL                    | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|   5 |     BUFFER SORT                          |                            |    10 |   130 |     6   (0)| 00:00:01 |
|   6 |      VIEW                                |                            |    10 |   130 |     2   (0)| 00:00:01 |
|   7 |       COUNT                              |                            |       |       |            |          |
|   8 |        CONNECT BY WITHOUT FILTERING      |                            |       |       |            |          |
|   9 |         FAST DUAL                        |                            |     1 |       |     2   (0)| 00:00:01 |
|  10 |   NESTED LOOPS                           |                            |    30 |  4530 |    32   (0)| 00:00:01 |
|  11 |    NESTED LOOPS                          |                            |    30 |  4530 |    32   (0)| 00:00:01 |
|  12 |     VIEW                                 |                            |    30 |  2760 |     2   (0)| 00:00:01 |
|  13 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6633_31D19D4 |    30 |   780 |     2   (0)| 00:00:01 |
|* 14 |     INDEX UNIQUE SCAN                    | OT_PK                      |     1 |       |     0   (0)|          |
|  15 |    TABLE ACCESS BY INDEX ROWID           | OBJECTS_TABLE              |     1 |    59 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("TT"."OWNER"='OUTLN')
  14 - access("OT"."OBJECT_ID"="DR"."OBJECT_ID")

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   0 -  STATEMENT
           -  no_adaptive_plan

   2 -  SEL$1
           -  materialize


There are three things that stand out in this report.

  • I’ve highlighted operations 4 and 6: operation 4 is the tablescan of tables_table that correctly estimates 3 rows; operation 6 is the view operation that now correctly estimates 10 rows.
  • With the correct estimate for the view the estimate for the join to objects_table is now correct and the join order for the merge join cartesian at operation 3 has been reversed.
  • The Hint Report tells us that the opt_estimate() hint is not (always) an optimizer hint! This is a real pain because when the opt_estimate() hints you’ve tried to use don’t appear to work it’s not easy to work out what you’ve done wrong.

To make a point, I can take the demo a little further by changing the /*+ opt_estimate() */ hint to scale_rows=120. Here’s the body of the resulting plan:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |   369 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D663A_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |   360 |  9360 |    13   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL                    | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|   5 |     BUFFER SORT                          |                            |   120 |  1560 |     6   (0)| 00:00:01 |
|   6 |      VIEW                                |                            |   120 |  1560 |     2   (0)| 00:00:01 |
|   7 |       COUNT                              |                            |       |       |            |          |
|   8 |        CONNECT BY WITHOUT FILTERING      |                            |       |       |            |          |
|   9 |         FAST DUAL                        |                            |     1 |       |     2   (0)| 00:00:01 |
|  10 |   HASH JOIN                              |                            |   360 | 54360 |   356   (1)| 00:00:01 |
|  11 |    VIEW                                  |                            |   360 | 33120 |     2   (0)| 00:00:01 |
|  12 |     TABLE ACCESS FULL                    | SYS_TEMP_0FD9D663A_31D19D4 |   360 |  9360 |     2   (0)| 00:00:01 |
|  13 |    TABLE ACCESS FULL                     | OBJECTS_TABLE              | 58383 |  3363K|   354   (1)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

The earlier plans used a nested loop join into objects_table. In this plan we can see at operation 10 that the optimizer has selected a hash join because the larger row estimate for the CTE has increased the cost of the query beyond the inflection point between nested loop and hash joins.

Summary

If you need to use the “connect by” in an inline view then you may find that the optimizer gets a very bad estimate of the number of rows the view definition will generate and that an /*+ opt_estimate() */ hint in the view using the “scale_rows=nnn” option will produce better estimates of cardinality, hence a better plan.

Footnote

In this particular case where I’ve used the dual table by itself in an inline view I could have used the rows=NNN” option to get the same effect.

In any case I could have added a /*+ qb_name() */ hint to the inline view, and included a qualifying “@{queryblock}” in the /*+ opt_estimate() */ hint.

Using hints is hard, especially when they’re not documented. There is a lot more to learn about this hint: for example, telling the optimizer about the size of a rowsource doesn’t help if it’s going to use its estimate of distinct values in the next steps of the plan – a correction you’ve managed to inject at one stage may disappear in the very next optimizer calculation.

This catalogue lists more articles on the opt_estimate() hint and its relatives.

June 17, 2022

Case Study

Filed under: Execution plans,Oracle,Parallel Execution,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 1:25 pm BST Jun 17,2022

The question “How do you trouble-shoot a performance problem” came up in an online session I did for the AIOUG (All-India OUG) today. It’s a very broad question and the only possible answers are either extremely generic, or very specific – so here’s a specific example that I talked about to give some idea of the processes I go through. It’s an example from the Oracle-l list server asking the (paraphrased) question:

I’ve got a parallel query that take 24 seconds to return 2,500 rows for 2018. .The same query for 2019 should return a similar amount of data but consumes a lot of TEMP space before failing; it takes 45 minutes to complete if I remove the parallel hint. The only difference between the two queries is a change to a single predicate: “opclf.year_number = to_number(‘YYYY’)” and the statistics are up to date – what’s going on?

The ease of diagnosing a problem is dependent on the ease of access to all the right information, and you don’t always know initially what the right information might be. In this case the OP had started by posting to github the SQL Monitor reports for the two queries, which were automatically generated since the queries were using parallel execution.

Click here to show/hide the first few sections for the 2019 output
SQL Text
------------------------------
SELECT /*+ PARALLEL(8) */ DISTINCT rd.document_id AS doc_id ,'firm_id' AS criteria_key ,opf.ultimate_parent_firm_id AS series_id ,period_number FROM ( SELECT /*+ PARALLEL(8) */ DISTINCT rd.document_id ,rd.client_role_id ,( CASE WHEN MONTHS_BETWEEN(TO_DATE('04/28/2022', 'MM/DD/YYYY'), TO_DATE('04/01/2017', 'MM/DD/YYYY')) > 12 THEN TRUNC(dc.date_value, 'YEAR') ELSE TRUNC(dc.date_value, 'MONTH') END ) period_number FROM REPORT_ENGINE.date_code dc ,REPORT_ENGINE.lit_fact_bd rd INNER JOIN
report_engine.firm FIRM ON rd.firm_id = FIRM.firm_id WHERE dc.date_value BETWEEN TO_DATE('04/01/2017', 'MM/DD/YYYY') AND TO_DATE('04/28/2022', 'MM/DD/YYYY') AND rd.publication_date_id = dc.date_id AND rd.year_number = to_number('2019') AND (FIRM.ultimate_parent_firm_id IN (to_number('252094'))) ) rd INNER JOIN report_engine.opposing_counsel op ON rd.client_role_id = op.legal_role_id INNER JOIN report_engine.lit_fact_bd opclf ON opclf.document_id = rd.document_id AND op.opposing_counsel_role_id
= opclf.client_role_id AND opclf.year_number = to_number('2019') INNER JOIN report_engine.firm opf ON opclf.firm_id = opf.firm_id AND opf.firm_id >= 1000 WHERE 1 = 1

Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  1
 Session             :  SYS (647:28741)
 SQL ID              :  3cjg20q2vw41f
 SQL Execution ID    :  16777216
 Execution Started   :  06/09/2022 05:08:24
 First Refresh Time  :  06/09/2022 05:08:25
 Last Refresh Time   :  06/09/2022 05:08:27
 Duration            :  221s
 Module/Action       :  sqlplus@c111dhw (TNS V1-V3)/-
 Service             :  SYS$USERS
 Program             :  sqlplus@c111dhw (TNS V1-V3)

Global Stats
====================================================================
| Elapsed |   Cpu   |    IO    |  Other   | Buffer | Write | Write |
| Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs  | Bytes |
====================================================================
|     222 |     179 |       28 |       15 |    49M | 11624 |   2GB |
====================================================================

Parallel Execution Details (DOP=8 , Servers Allocated=16)
=====================================================================================================================================
|      Name      | Type  | Server# | Elapsed |   Cpu   |    IO    |  Other   | Buffer | Write | Write |         Wait Events         |
|                |       |         | Time(s) | Time(s) | Waits(s) | Waits(s) |  Gets  | Reqs  | Bytes |         (sample #)          |
=====================================================================================================================================
| PX Coordinator | QC    |         |    0.26 |    0.26 |          |     0.00 |  12144 |       |     . |                             |
| p000           | Set 1 |       1 |         |         |          |          |        |       |     . |                             |
| p001           | Set 1 |       2 |         |         |          |          |        |       |     . |                             |
| p002           | Set 1 |       3 |         |         |          |          |        |       |     . |                             |
| p003           | Set 1 |       4 |         |         |          |          |        |       |     . |                             |
| p004           | Set 1 |       5 |         |         |          |          |        |       |     . |                             |
| p005           | Set 1 |       6 |         |         |          |          |        |       |     . |                             |
| p006           | Set 1 |       7 |         |         |          |          |        |       |     . |                             |
| p007           | Set 1 |       8 |         |         |          |          |        |       |     . |                             |
| p008           | Set 2 |       1 |     220 |     177 |       28 |       15 |    48M | 11624 |   2GB | direct path write temp (28) |
| p009           | Set 2 |       2 |         |         |          |          |        |       |     . |                             |
| p010           | Set 2 |       3 |         |         |          |          |        |       |     . |                             |
| p011           | Set 2 |       4 |    1.71 |    1.70 |          |     0.01 |   595K |       |     . |                             |
| p012           | Set 2 |       5 |         |         |          |          |        |       |     . |                             |
| p013           | Set 2 |       6 |         |         |          |          |        |       |     . |                             |
| p014           | Set 2 |       7 |         |         |          |          |        |       |     . |                             |
| p015           | Set 2 |       8 |         |         |          |          |        |       |     . |                             |
=====================================================================================================================================

You’ll note that I haven’t got as far as the body of the execution plan yet, and I’ve highlighted line 44 – a line in the middle of the summary of activity for the parallel execution processes. There are 8 servers in each of two sets (we’re running parallel 8) and the line I’ve highlighted is the first server of the second set. The reason I’ve highlighted it is that it’s the one server that’s doing almost all the work – effectively the query (at some point in the plan) is running serially.

So, a first thought, maybe we’ve just been unlucky and running parallel 7 (say) would result in a better distribution of data across parallel servers and allow each of 7 processes to do whatever they had to do to a much smaller amount of data. Maybe a change to the distribution method (pq_distribute() hint) would spread the workload more evenly. In either case “N” smaller workload might still use a lot of TEMP, but possibly no individual process would fail, and the job would complete nearly N times as quickly.

Note: the 2018 Monitor report showed an equivalent skew in the data distribution, but the amount of work needed was much smaller with a read/write load of only 143MB compared to 2GB for the 2019 report. The OP did try running parallel 7, but with no change to the overall effect.

Let’s take a quick glance at the plan body (click to view/hide)
SQL Plan Monitoring Details (Plan Hash Value=1862006233)
=========================================================================================================================================================================================================================
| Id    |                     Operation                      |           Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Write | Write | Mem | Temp | Activity |        Activity Detail       |
|       |                                                    |                           | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs  | Bytes |     |      |   (%)     |         (# samples)         |
=========================================================================================================================================================================================================================
|     0 | SELECT STATEMENT                                   |                           |         |      |           |        |     3 |          |       |       |     |      |   |                             |
|     1 |   PX COORDINATOR                                   |                           |         |      |           |        |     3 |          |       |       |     |      |   |                             |
|     2 |    PX SEND QC (RANDOM)                             | :TQ10003                  |     781 |  153 |           |        |       |          |       |       |     |      |   |                             |
|     3 |     HASH UNIQUE                                    |                           |     781 |  153 |           |        |       |          |       |       |     |      |   |                             |
|     4 |      PX RECEIVE                                    |                           |     781 |  151 |           |        |       |          |       |       |     |      |   |                             |
|     5 |       PX SEND HASH                                 | :TQ10002                  |     781 |  151 |           |        |     2 |          |       |       |     |      |   |                             |
|  -> 6 |        BUFFER SORT                                 |                           |     781 |  153 |       219 |     +3 |     2 |        0 | 11624 |   2GB | 61M |   2G |    26.21 | Cpu (26)                     |
|       |                                                    |                           |         |      |           |        |       |          |       |       |     |      |   | direct path write temp (28) |
|  -> 7 |         NESTED LOOPS                               |                           |     781 |  151 |       219 |     +3 |     2 |      39M |       |       |     |      |   |                             |
|  -> 8 |          NESTED LOOPS                              |                           |     781 |  151 |       219 |     +3 |     2 |      39M |       |       |     |      |     0.49 | Cpu (1)                      |
|  -> 9 |           NESTED LOOPS                             |                           |     777 |   44 |       219 |     +3 |     2 |      39M |       |       |     |      |   |                             |
| -> 10 |            NESTED LOOPS                            |                           |      41 |   26 |       219 |     +3 |     2 |     6463 |       |       |     |      |   |                             |
| -> 11 |             HASH JOIN                              |                           |      41 |   21 |       219 |     +3 |     2 |     6463 |       |       |  6M |      |   |                             |
|    12 |              BUFFER SORT                           |                           |         |      |         1 |     +3 |     2 |    36855 |       |       |     |      |   |                             |
|    13 |               PX RECEIVE                           |                           |      87 |   19 |         1 |     +3 |     2 |    36855 |       |       |     |      |   |                             |
|    14 |                PX SEND HASH                        | :TQ10000                  |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    15 |                 NESTED LOOPS                       |                           |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    16 |                  NESTED LOOPS                      |                           |      87 |   19 |         1 |     +3 |     1 |    38694 |       |       |     |      |   |                             |
|    17 |                   TABLE ACCESS BY INDEX ROWID      | FIRM                      |       1 |    2 |         1 |     +3 |     1 |       43 |       |       |     |      |   |                             |
|    18 |                    INDEX RANGE SCAN                | FIRM_ULT_PARENT_FIRM_IDX1 |       1 |    1 |         1 |     +3 |     1 |       43 |       |       |     |      |   |                             |
|    19 |                   PARTITION RANGE SINGLE           |                           |         |      |         1 |     +3 |    43 |    38694 |       |       |     |      |   |                             |
|    20 |                    BITMAP CONVERSION TO ROWIDS     |                           |         |      |         1 |     +3 |    43 |    38694 |       |       |     |      |   |                             |
|    21 |                     BITMAP INDEX SINGLE VALUE      | LIT_FACT_BD_IDX09         |         |      |         1 |     +3 |    43 |       49 |       |       |     |      |   |                             |
|    22 |                  TABLE ACCESS BY LOCAL INDEX ROWID | LIT_FACT_BD               |      63 |   19 |         3 |     +1 | 38694 |    38694 |       |       |     |      |     0.49 | Cpu (1)                      |
|    23 |              PX RECEIVE                            |                           |      20 |    2 |         1 |     +3 |     2 |        2 |       |       |     |      |   |                             |
|    24 |               PX SEND HASH                         | :TQ10001                  |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
|    25 |                PX BLOCK ITERATOR                   |                           |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
|    26 |                 TABLE ACCESS FULL                  | OPPOSING_COUNSEL          |      20 |    2 |           |        |       |          |       |       |     |      |   |                             |
| -> 27 |             TABLE ACCESS BY INDEX ROWID            | DATE_CODE                 |       1 |      |       219 |     +3 |  6465 |     6463 |       |       |     |      |   |                             |
| -> 28 |              INDEX UNIQUE SCAN                     | PK_DATE_CODE              |       1 |      |       219 |     +3 |  6465 |     6465 |       |       |     |      |   |                             |
| -> 29 |            PARTITION RANGE SINGLE                  |                           |      19 |      |       219 |     +3 |  6465 |      39M |       |       |     |      |   |                             |
| -> 30 |             TABLE ACCESS BY LOCAL INDEX ROWID      | LIT_FACT_BD               |      19 |      |       220 |     +2 |  6465 |      39M |       |       |     |      |    35.92 | Cpu (74)                     |
| -> 31 |              INDEX RANGE SCAN                      | LIT_FACT_BD_IDX20         |       1 |      |       219 |     +3 |  6465 |      39M |       |       |     |      |     9.22 | Cpu (19)                     |
| -> 32 |           INDEX UNIQUE SCAN                        | PK_FIRM                   |       1 |      |       219 |     +3 |   39M |      39M |       |       |     |      |    10.68 | Cpu (22)                     |
| -> 33 |          TABLE ACCESS BY INDEX ROWID               | FIRM                      |       1 |      |       219 |     +3 |   39M |      39M |       |       |     |      |    16.99 | Cpu (35)                     |
===================================================================================================================================================================

You can see from the “->” symbols at the left hand side of the plan that this report was generated while the plan was still running. The thing that leaps out as you glance down the page is the value in the “Rows (Actual)” column at operations 7-9 (which show the rowsources generated by some nested loop joins) and operations 29, 32 and 33 of the plan that tell us something about how those rowsources were generated.

Operation 29 has executed (Execs) 6,465 so far, producing a total of 39M rows, and operations 32 and 33 have both executed 39M times each producing a total of 39M rows by index unique scan.

The plan for the 2018 data was similar though the join order for DATE_CODE, LIT_FACT_BD and FIRM was different (and it was the join to LIT_FACT_BD that increased the row count dramatically – so hinting it to be the last table in the join might help a bit), but the largest rowcount for the 2018 query was only 3M rows, not the 39M that had appeared after only 6,465 rows of a possible driving 39,855 in the 2019 query.

So it’s not surprising that the query could take much longer for 2019. It’s not the volume of output that matters, it’s the volume of input (or, more accurately, throughput or intermediate) data that matters.

Let’s think about that volume, though: the 2018 plan generated 3M rows and then crunched them down to 2,500 rows and the 2019 plan was supposed to produce a similar sized output (from 39M+ rows). Could we rewrite the query in some way that made it do some intermediate aggregation so that the volume of data to be aggregated was never enormous?

Let’s take a look at the plan from the 2018 Monitor report (click to show/hide)
SQL Plan Monitoring Details (Plan Hash Value=472871521)
=======================================================================================================================================================================================================================================
| Id |                     Operation                      |           Name            |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  | Write | Write |  Mem  | Temp  | Activity |       Activity Detail       |
|    |                                                    |                           | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)     |        (# samples)         |
=======================================================================================================================================================================================================================================
|  0 | SELECT STATEMENT                                   |                           |         |      |         1 |    +24 |    17 |     2613 |      |       |       |       |       |       |   |                            |
|  1 |   PX COORDINATOR                                   |                           |         |      |         1 |    +24 |    17 |     2613 |      |       |       |       |       |       |   |                            |
|  2 |    PX SEND QC (RANDOM)                             | :TQ10003                  |       1 |   39 |         1 |    +24 |     8 |     2613 |      |       |       |       |       |       |   |                            |
|  3 |     HASH UNIQUE                                    |                           |       1 |   39 |         9 |    +16 |     8 |     2613 |      |       |       |       |    9M |       |     6.90 | Cpu (2)                     |
|  4 |      PX RECEIVE                                    |                           |       1 |   38 |         9 |    +16 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  5 |       PX SEND HASH                                 | :TQ10002                  |       1 |   38 |        12 |    +14 |     8 |       3M |      |       |       |       |       |       |     3.45 | Cpu (1)                     |
|  6 |        BUFFER SORT                                 |                           |       1 |   39 |        23 |     +2 |     8 |       3M | 4584 | 143MB |   703 | 143MB |  151M |  151M |    34.48 | Cpu (2)                     |
|    |                                                    |                           |         |      |           |        |       |          |      |       |       |       |       |       |   | direct path read temp (6)  |
|    |                                                    |                           |         |      |           |        |       |          |      |       |       |       |       |       |   | direct path write temp (2) |
|  7 |         NESTED LOOPS                               |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  8 |          NESTED LOOPS                              |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
|  9 |           NESTED LOOPS                             |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
| 10 |            NESTED LOOPS                            |                           |       1 |   38 |        15 |     +2 |     8 |       3M |      |       |       |       |       |       |   |                            |
| 11 |             HASH JOIN                              |                           |      41 |   21 |        15 |     +2 |     8 |    19334 |      |       |       |       |    7M |       |   |                            |
| 12 |              BUFFER SORT                           |                           |         |      |        13 |     +2 |     8 |    19233 |      |       |       |       |    1M |       |   |                            |
| 13 |               PX RECEIVE                           |                           |      89 |   19 |        13 |     +2 |     8 |    19233 |      |       |       |       |       |       |   |                            |
| 14 |                PX SEND HASH                        | :TQ10000                  |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 15 |                 NESTED LOOPS                       |                           |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 16 |                  NESTED LOOPS                      |                           |      89 |   19 |         1 |     +1 |     1 |    19233 |      |       |       |       |       |       |   |                            |
| 17 |                   TABLE ACCESS BY INDEX ROWID      | FIRM                      |       1 |    2 |         1 |     +1 |     1 |       43 |      |       |       |       |       |       |   |                            |
| 18 |                    INDEX RANGE SCAN                | FIRM_ULT_PARENT_FIRM_IDX1 |       1 |    1 |         1 |     +1 |     1 |       43 |      |       |       |       |       |       |   |                            |
| 19 |                   PARTITION RANGE SINGLE           |                           |         |      |         1 |     +1 |    43 |    19233 |      |       |       |       |       |       |   |                            |
| 20 |                    BITMAP CONVERSION TO ROWIDS     |                           |         |      |         1 |     +1 |    43 |    19233 |      |       |       |       |       |       |   |                            |
| 21 |                     BITMAP INDEX SINGLE VALUE      | LIT_FACT_BD_IDX09         |         |      |         1 |     +1 |    43 |       51 |      |       |       |       |       |       |   |                            |
| 22 |                  TABLE ACCESS BY LOCAL INDEX ROWID | LIT_FACT_BD               |      64 |   19 |         1 |     +1 | 19233 |    19233 |      |       |       |       |       |       |   |                            |
| 23 |              PX RECEIVE                            |                           |      20 |    2 |        15 |     +2 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 24 |               PX SEND HASH                         | :TQ10001                  |      20 |    2 |         1 |    +14 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 25 |                PX BLOCK ITERATOR                   |                           |      20 |    2 |         1 |    +14 |     8 |       20 |      |       |       |       |       |       |   |                            |
| 26 |                 TABLE ACCESS FULL                  | OPPOSING_COUNSEL          |      20 |    2 |         1 |    +14 |     3 |       20 |      |       |       |       |       |       |   |                            |
| 27 |             PARTITION RANGE SINGLE                 |                           |       1 |      |        15 |     +2 | 19334 |       3M |      |       |       |       |       |       |   |                            |
| 28 |              TABLE ACCESS BY LOCAL INDEX ROWID     | LIT_FACT_BD               |       1 |      |        16 |     +1 | 19334 |       3M |      |       |       |       |       |       |    17.24 | Cpu (5)                     |
| 29 |               INDEX RANGE SCAN                     | LIT_FACT_BD_IDX20         |       1 |      |        15 |     +2 | 19334 |       3M |      |       |       |       |       |       |   |                            |
| 30 |            TABLE ACCESS BY INDEX ROWID             | DATE_CODE                 |       1 |      |        15 |     +2 |    3M |       3M |      |       |       |       |       |       |    10.34 | Cpu (3)                     |
| 31 |             INDEX UNIQUE SCAN                      | PK_DATE_CODE              |       1 |      |        16 |     +1 |    3M |       3M |      |       |       |       |       |       |     6.90 | Cpu (2)                     |
| 32 |           INDEX UNIQUE SCAN                        | PK_FIRM                   |       1 |      |        23 |     +2 |    3M |       3M |      |       |       |       |       |       |     6.90 | Cpu (2)                     |
| 33 |          TABLE ACCESS BY INDEX ROWID               | FIRM                      |       1 |      |        16 |     +1 |    3M |       3M |      |       |       |       |       |       |    13.79 | Cpu (4)                     |
=======================================================================================================================================================================================================================================

We see from operations 3 – 7 that the 3M rows generated from the nested loop joins pass up through a buffer sort operation before being crunched down to 2,613 rows. It’s probably the buffer sort (which is buffering but not sorting) that has mostly passed through a single server and spilled to disc in the 2019 report. We just don’t want that 39M+ rows ever to exist.

So how easy will it be to change the SQL (click to view/hide)
SELECT
        /*+ PARALLEL(8) */
        DISTINCT rd.document_id AS doc_id
        ,'firm_id' AS criteria_key
        ,opf.ultimate_parent_firm_id AS series_id
        ,period_number
FROM (
        SELECT
                /*+ PARALLEL(8) */
                DISTINCT rd.document_id
                ,rd.client_role_id
                ,(
                        CASE 
                                WHEN MONTHS_BETWEEN(TO_DATE('04/28/2022', 'MM/DD/YYYY'), TO_DATE('04/01/2017', 'MM/DD/YYYY')) > 12
                                        THEN TRUNC(dc.date_value, 'YEAR')
                                ELSE TRUNC(dc.date_value, 'MONTH')
                                END
                        ) period_number
        FROM REPORT_ENGINE.date_code dc
                ,REPORT_ENGINE.lit_fact_bd rd
        INNER JOIN report_engine.firm FIRM ON rd.firm_id = FIRM.firm_id
        WHERE dc.date_value BETWEEN TO_DATE('04/01/2017', 'MM/DD/YYYY')
                        AND TO_DATE('04/28/2022', 'MM/DD/YYYY')
                AND rd.publication_date_id = dc.date_id
                AND rd.year_number = to_number('2019')
                AND (FIRM.ultimate_parent_firm_id IN (to_number('252094')))
        ) rd
INNER JOIN report_engine.opposing_counsel op ON rd.client_role_id = op.legal_role_id
INNER JOIN report_engine.lit_fact_bd opclf ON opclf.document_id = rd.document_id
        AND op.opposing_counsel_role_id = opclf.client_role_id
        AND opclf.year_number = to_number('2019')
INNER JOIN report_engine.firm opf ON opclf.firm_id = opf.firm_id
        AND opf.firm_id >= 1000
WHERE 1 = 1;

Lines 7-10 and 27 tell us we alredy have an inline view where we’re doing a “select distinct” and, unwinding the mix of “Oracle” and “ANSI” syntax, we can see that it joins DATE_CODE, LIT_FACT_BD and FIRM, and we know that one of those tables explodes the intermediate data size to something enormous. So it looks like the original author of this code had already worked out that the query needed to aggregate early.

Checking back to the original plans we note that there’s only one “hash unique” operation, and there’s no sign of a “view” operation, so maybe the performance problem is a result of the optimizer suddenly deciding it can do complex view merging with this inline view, and perhaps all we need to do is add the hint /*+ no_merge */ to the inline view and see what happens.

Here’s the plan after adding the hint (click to hide/vew)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----
| Id  | Operation                                            | Name                      | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----
|   0 | SELECT STATEMENT                                     |                           |       |       |   314 (100)|          |       |       |        |      | |
|   1 |  PX COORDINATOR                                      |                           |       |       |            |          |       |       |        |      | |
|   2 |   PX SEND QC (RANDOM)                                | :TQ10005                  |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | P->S | QC (RAN D)  |
|   3 |    HASH UNIQUE                                       |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | PCWP | |
|   4 |     PX RECEIVE                                       |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,05 | PCWP | |
|   5 |      PX SEND HASH                                    | :TQ10004                  |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,04 | P->P | HASH |
|   6 |       HASH UNIQUE                                    |                           |    23 |  2599 |   314   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   7 |        NESTED LOOPS                                  |                           |    23 |  2599 |   313   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   8 |         NESTED LOOPS                                 |                           |    23 |  2599 |   313   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|   9 |          NESTED LOOPS                                |                           |    23 |  2323 |   310   (1)| 00:00:06 |       |       |  Q1,04 | PCWP | |
|* 10 |           HASH JOIN                                  |                           |   388 | 21340 |   148   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  11 |            PX RECEIVE                                |                           |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,04 | PCWP | |
|  12 |             PX SEND BROADCAST                        | :TQ10002                  |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | P->P | BROADCA ST  |
|  13 |              PX BLOCK ITERATOR                       |                           |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWC | |
|* 14 |               TABLE ACCESS FULL                      | OPPOSING_COUNSEL          |    20 |   160 |     2   (0)| 00:00:01 |       |       |  Q1,02 | PCWP | |
|  15 |            VIEW                                      |                           |   835 | 39245 |   146   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  16 |             HASH UNIQUE                              |                           |   835 | 63460 |   146   (1)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  17 |              PX RECEIVE                              |                           |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,04 | PCWP | |
|  18 |               PX SEND HASH                           | :TQ10003                  |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,03 | P->P | HASH |
|* 19 |                HASH JOIN BUFFERED                    |                           |   835 | 63460 |   145   (0)| 00:00:03 |       |       |  Q1,03 | PCWP | |
|  20 |                 BUFFER SORT                          |                           |       |       |            |          |       |       |  Q1,03 | PCWC | |
|  21 |                  PX RECEIVE                          |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |  Q1,03 | PCWP | |
|  22 |                   PX SEND HASH                       | :TQ10000                  |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        | S->P | HASH |
|  23 |                    NESTED LOOPS                      |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        |      | |
|  24 |                     NESTED LOOPS                     |                           |   835 | 52605 |   136   (0)| 00:00:03 |       |       |        |      | |
|  25 |                      TABLE ACCESS BY INDEX ROWID     | FIRM                      |     1 |    12 |     2   (0)| 00:00:01 |       |       |        |      | |
|* 26 |                       INDEX RANGE SCAN               | FIRM_ULT_PARENT_FIRM_IDX1 |     1 |       |     1   (0)| 00:00:01 |       |       |        |      | |
|  27 |                      PARTITION RANGE SINGLE          |                           |       |       |            |          |    30 |    30 |        |      | |
|  28 |                       BITMAP CONVERSION TO ROWIDS    |                           |       |       |            |          |       |       |        |      | |
|* 29 |                        BITMAP INDEX SINGLE VALUE     | LIT_FACT_BD_IDX09         |       |       |            |          |    30 |    30 |        |      | |
|* 30 |                     TABLE ACCESS BY LOCAL INDEX ROWID| LIT_FACT_BD               |   598 | 30498 |   136   (0)| 00:00:03 |    30 |    30 |        |      | |
|  31 |                 PX RECEIVE                           |                           |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,03 | PCWP | |
|  32 |                  PX SEND HASH                        | :TQ10001                  |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | P->P | HASH |
|  33 |                   PX BLOCK ITERATOR                  |                           |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | PCWC | |
|* 34 |                    TABLE ACCESS FULL                 | DATE_CODE                 |  1854 | 24102 |     9   (0)| 00:00:01 |       |       |  Q1,01 | PCWP | |
|  35 |           PARTITION RANGE SINGLE                     |                           |     1 |    46 |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 36 |            TABLE ACCESS BY LOCAL INDEX ROWID         | LIT_FACT_BD               |     1 |    46 |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 37 |             INDEX RANGE SCAN                         | LIT_FACT_BD_IDX20         |     1 |       |     0   (0)|          |    30 |    30 |  Q1,04 | PCWP | |
|* 38 |          INDEX UNIQUE SCAN                           | PK_FIRM                   |     1 |       |     0   (0)|          |       |       |  Q1,04 | PCWP | |
|  39 |         TABLE ACCESS BY INDEX ROWID                  | FIRM                      |     1 |    12 |     0   (0)|          |       |       |  Q1,04 | PCWP | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -----

Note particularly that operations 15 and 16 tell us that we’ve forced the optimizer into treating the inline view as a separate query block and we now have two aggregation steps, one inside the view, and another after joining FIRM (again) and LIT_FACT_BD (again) to the inline view.

Unfortunately the plan shown here is pulled from memory using dbms_xplan.display_cursor() after execution, so it include the various parallel executoin colums (TQ, IN-OUT, PQ Distrib), but doesn’t have the rowsource execution stats enabled so we can’t see the actual workload and data volume, but in the words of the OP: “adding no_merge hint did the trick and now the SQL is just executing fine”.

Summary

The steps for solving the performance problems of a specific SQL statement are very fluid. For a long-running or parallel statement the SQL Monitor report will automatically be created (though there are limits on the size of the plan that may disable the feature) and this is the easiest source of useful information, though you might also need to pull the execution plan from v$sql_plan to get details about parallel execution and partitioning at the same time.

If you’re not licensed for the diagnostic and performance packs, though, enabling SQL Trace to get the plan and waits gets you a lot of infomation, and querying (g)v$pq_tqstat immediately after running the query can fill in the parallel traffic details.

In this example the SQL Monitor report showed a highly skewed distribution, which might have been fixable by changing the PQ distribution, or even by simply changing the degree of parallelism.

Further examination of the report showed that the query generated an enormous rowsource which it then crunched down to a small result set. Comparing the 2018 and 2019 plans (which were not quite identical but were sufficiently similar to justify comparison) the same skew and explosion of rowsource were visible in both, though the data size involved in the 2018 plan made it seem that the plan was a “good” one which really it wasn’t.

The obvious target for tuning was to see if the explosion in volume could be reduced or eliminated by writing the query with some intermediate non-mergeable view(s), and it turned out that the query had been written with that intent in its original form but without a hint to block complex view merging. After adding the hint the performance was acceptable.

May 5, 2022

Lag/Lead slow

Filed under: Bugs,Oracle,Performance,Problem Solving,Troubleshooting,Tuning — Jonathan Lewis @ 10:05 am BST May 5,2022

This note is about a surprising performance difference between the lead() and lag() analytic functions (which31 turns out to be due to the behaviour of the nth_value() function) when the option to “ignore nulls” is included in their use (jump to conclusion). The detail I’ll be writing about was highlighted in a thread on the Oracle developer forum about a requirement to add a number of analytic columns to a dataset of 156 million rows using a statement of the following shape:

create table tb_target_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

You’ll notice that I’ve introduced a row_number(), and both a lead() and a lag() of a column called ed_ucc_dt. All three analytic columns use the same partitioning and ordering, though, so Oracle will only be doing one “window sort” in the execution plan. Part of the performance problem, of course, was that with 156M rows of a couple of dozen existing columns and adding a dozen new columns, the workload due to sorting was bound to be very large, so there were various suggestions of how to minimise that part of the workload.

However Solomon Yakobson pointed out that the code was using the “ignore nulls” option and there was a bug in 11g that made lead() and lag() very slow when this option was used. He subsequently reported that this defect was still present in 19c, restricted it to just the lag() function, and cited a MOS document ID referencing the problem: LAG Function is slow when using ignore nulls (Doc ID 2811596.1). The implication of the MOS note is that we shouldn’t expect this to change.

A follow-up posting by User_H3J7U gave us a reason for the slowness of the lag() function by running a sample query through dbms_utility.expand_sql(). Oracle rewrites the query to use variants of the nth_value() function when you use “ignore nulls”, but rewrites it to use variants of first_value() when you aren’t using the “ignore nulls” option. This isn’t a complete explanation of why lag() should be slow while lead() is not – but it’s a significant pointer towards a possible implementation issue and is a good clue about working around the problem. So let’s build a model of the situation.

The basic model

rem
rem     Script:         lag_ignore_nulls.sql
rem     Author:         Jonathan Lewis / Sam P
rem     Dated:          May 2022
rem     Purpose:     
rem
rem     Last tested
rem             19.11.0.0
rem

create table tb_source (
        pat_id,
        visit_date_dt,
        ed_ucc_dt
)
as
with generator as (
        select rownum id
        from    dual
        connect by
                level <= 1e4    --> comment to avoid wordpress format issue
)
select
        g1.id,
        to_date('01-Apr-2022') + dbms_random.value(0,100),
        to_date('01-Apr-2022') + dbms_random.value(5,105)
--      to_date(null)
from
        generator g1,
        generator g2
where
        g2.id <= 20     --> comment to avoid wordpress format issue
order by
        dbms_random.value
/


spool lag_ignore_nulls.lst

set serveroutput off
set timing on

prompt  ======================
prompt  Without "ignore nulls"
prompt  (My time 0.61 seconds)
prompt  ======================

create table tb_target_no_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

prompt  ======================
prompt  With "ignore nulls"
prompt  (My time 0.88 seconds)
prompt  ======================

create table tb_target_ignore
as
select
        pat_id,
        visit_date_dt,
        ed_ucc_dt,
        lag (ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as prev_ed_ucc_dt,
        lead(ed_ucc_dt ignore nulls, 1) over (partition by pat_id order by visit_date_dt) as next_ed_ucc_dt,
        row_number() over (partition by pat_id order by visit_date_dt) as row_num
from
        tb_source
;

I’ve created a source table with 200,000 rows, consisting of 10,000 pat_id values, and 20 rows per pat_id. The 20 rows for a pat_id (probably) each have a different visit_date_dt and a different ed_ucc_dt.

After creating the data set I’ve created two more tables using the lead() and lag() functions to generate a previous (lag) and next (lead) ed_ucc_dt for each row, partitioning by pat_id, ordering by visit_date_dt. One statement includes the “ignore nulls” option the other doesn’t and, as you can see, the time to create the “no ignore” table was 0.61 seconds while the time to create the “ignore null” table was 0.88 seconds.

The variation isn’t dramatic – but this is just 200,000 rows, in memory, with only a few columns and only two columns added through lead and lag.

After the baseline test I tweaked the statement that created the table with the “ignore nulls” option to get three more times.

  • With neither lead() nor lag() the time was 0.29 seconds
  • With just the lead() column the time was 0.46 seconds – an increase of 0.17 seconds
  • With just the lag() column the time was 0.71 seconds – an increase of 0.42 seconds

You might note that 0.29 + 0.17 + 0.42 = 0.88 (the time I got for adding both columns) – it’s a little lucky that it looks like a perfect match, but even matching within a couple of hundredths of a second would be have been a nice detail. It certainly seems that lag() – with my test data – consumes more resources than lead() for a pair of operationd that look as if they should produce the same workloads.

Internal Rewrite

The next step was to check what the internal rewrite of the code looked like, so I passed the select part of the statements (the procedure won’t accepts “create as select”) through dbms_utility.expand_sql() and reformatted the results. Here are the two rewritten statements – first without “ignore nulls”:

select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        decode(
                count(*) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 preceding and 1 preceding
                        ),a
                 1,     first_value(a1.ed_ucc_dt) over (
                                partition by a1.pat_id order by a1.visit_date_dt
                                rows between 1 preceding and 1 preceding
                        ),
                        null
        ) prev_ed_ucc_dt,
        decode(
                count(*) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and 1 following
                        ),a
                 1,     first_value(a1.ed_ucc_dt) over (
                                partition by a1.pat_id order by a1.visit_date_dt
                                rows between 1 following and 1 following
                        ),
                        null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1


The code looks a little long and messy, but that’s mainly because just about everything it does happens twice. The lag() function (prev column) turns into a first_value() function that looks at the row preceding the current row in the partition (rows between 1 preceding and 1 preceding). However it first has to count over the same clause to see if a row exists, and then either report its value or report a null – hence the structure decode(count(), 1, first_value(), null)

Note: the full lag() function call is: “lag(expression, offset, default)” where the offset (number of rows to lag) defaults to 1 and the default is the value you want reported when there is no matching row, and defaults to null.

The call to lead() basically does the same thing, but uses (rows between 1 following and 1 following) to access the next row in the partition.

On the other hand this is the SQL that Oracle generates when we include the “ignore nulls” clause (which means Oracle can’t restrict the row range to just one preceding or following row):

select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        nvl(
                nth_value(a1.ed_ucc_dt, 1) from last ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt,
        nvl(
                nth_value(a1.ed_ucc_dt, 1)           ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and unbounded following
                ),
                null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1

Both lag() and lead() turn into nth_value() with a second parameter of 1 (i.e. nth == 1st … which makes you wonder why Oracle isn’t using first_value()), and we can also see the “ignore nulls” still being used.

The lag() call now uses the range (rows between unbounded preceding and 1 preceding) i.e. everything from the start of partition to the previous row, while the lead() call uses the range (rows between 1 following and unbounded following) i.e. from the next row to the end of partition.

The other important detail to note is that the translation of the lag() call also includes the clause “from last” – in other words we want the first row when reading the partition in reverse order, and that might have something to do with the extra time it takes to operate the (translated) lag() function.

Workaround

Oracle is using a generic nth_value() to translate a generic lead()/lag(), but we’re in the special case where we know n = 1, which means we (and Oracle) could use first_value()/last_value(). It’s perfectly reasonable for Oracle’s internal code to avoid special cases if it makes no difference to performance, of course, but maybe in this case we could imitate Oracle’s rewrite to get some benefit.

  • Step 1 – change nth_value() to the appropriate first/last.
  • Step 2 – get rid of the “from last” which won’t be needed with last_value()
  • Step 3 – move the “ignore nulls” to the spot that Oracle wants to see it with first/last

Here’s the resulting SQL – I’ve left the nvl(count, expression, null) in place, but if you wanted a null as the default return value for the original lead()/lag() calls you could simplify the code a little further.

create table tb_target 
as
select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
        nvl(
                last_value(a1.ed_ucc_dt ignore nulls) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt,
        nvl(
                first_value(a1.ed_ucc_dt ignore nulls) over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between 1 following and unbounded following
                ),
                null
        ) next_ed_ucc_dt,
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1
/

  • Run-time: 0.61 seconds.
  • Run-time with just last_value() / preceding: 0.47 seconds
  • Run time with just first_value() / following: 0.43 seconds

There still seems to be a little discrepancy between accessing to the preceding data compared to accessing the following data but there’s a much better balance than before.

One more edit – taking out the nvl() construct because the original lead()/lag() calls didn’t have a non-null default supplied:

create table tb_target 
as
select
        a1.pat_id pat_id,
        a1.visit_date_dt visit_date_dt,
        a1.ed_ucc_dt ed_ucc_dt,
--
        last_value(a1.ed_ucc_dt ignore nulls) over (
                partition by a1.pat_id order by a1.visit_date_dt
                rows between unbounded preceding and 1 preceding
        ) prev_ed_ucc_dt,
--
        first_value(a1.ed_ucc_dt ignore nulls) over (
                partition by a1.pat_id order by a1.visit_date_dt
                rows between 1 following and unbounded following
        ) next_ed_ucc_dt,
--
        row_number() over (partition by a1.pat_id order by a1.visit_date_dt) row_num
from
        test_user.tb_source a1
/

With this simplification the time dropped by a further couple of hundredths of a second hovering between 0.57 and 0.58 seconds.

There was one last detail about the test code that should be mentioned – do the changes in code still produce the same results? As a quick and dirty check I ran the following query after each variant of creating the tb_target table (tb_target_ignore is the table created using the original lead()/lag() code with “ignore nulls”):

select  *
from   (
        select * from tb_target minus select * from tb_target_ignore
        union all
        select * from tb_target_ignore minus select * from tb_target
        )
/

The result was always zero rows.

Hypothesis

I raised the idea that the difference in timing for the lead() and lag() functions might have something to do with the volume of data that Oracle could be processing to find the one row it needed.

My thinking was that for the lead() rewrite – the call to nth_value(ed_ucc_dt,1) – would simply be looking at the next row in the partition (if it existed) because my data has no nulls that neeed to be ignored, while the rewrite of the lag() function with its “from last” requirement could be making Oracle re-read the entire preceding section of the partition before starting to process it backwards.

As a possible check to see if this was a viable hypothesis I ran one more test – visible in the initial declaration of tb_source – I created the data with ed_ucc_dt set to null in every single row, so that Oracle would be forced to process from the current position to whichever end of the partition was relevant regardless of whether it was calling lead() or lag().

With this change in place the timing for the lead() only and lag() only statements were nearly identical – which is a weak support for the hypothesis.

And once I’d done that test the next obvious test was to see what happened if I increased size of each partition (using non-null values for ed_ucc_dt) to see if larger partitions would increase the difference between the forward and backward tests. To do this I changed the script to create the tb_source table to produce 5,000 pat_id value with 40 rows per pat_id by changing the where clause to:

where
        g2.id <= 40     --> comment to avoid wordpress format issue
and     g1.id <= 5e3    --> comment to avoid wordpress format issue

With this change in place the timings for the original form of the lead()/lag() statement were:

  • With both lead() and lag() in place the time was 1.05 seconds
  • With neither lead() nor lag() the time was 0.25 seconds
  • With just the lead() column the time was 0.41 seconds – an increase of 0.16 seconds
  • With just the lag() column the time was 0.98 seconds – an increase of 0.73 seconds

So the lag() time (ballpark figures) nearly doubles as the partition size doubles but the lead() time stays pretty much the same.

The results of these two tests do tend to suggest that the generic nth_value() implementation can do some short-circuiting when working “forwards”, using a mechanism that isn’t available when the “from last” clause requires it to work “backwards”.

Writing the previous paragraph prompted me to do one last test – it wouldn’t produce the same results, of course, but I ought to check the performance when I moved the “from last” clause out of the “prev”/lag() column expression into the “next”/lead() column expression in Oracle’s original translation to confirm that the problem was associated with the “from last” and not with the choice of “preceding” or “following” in the row range section of the over() clause. (It was the “from last” that made the difference.)

tl;dr

If you’re using the lag() or lead() functions with “ignore nulls” on a very large dataset you may find that you can rewrite the code with first_value() or last_value() calls that use less CPU. The benefit is probably only significant on fairly large data sets, and may be particularly noticeable for cases where the same over() clause is used many times and the partition sizes are more than a couple of dozen rows each.

The potential for excess CPU usage comes from the effect of a generic internal rewrite using the nth_value() function with the “from last” clause even for the special case where your lead()/lag() have an offset (i.e. n) of 1 which would allow for the use of first_value()/last_value().

To get the correct rewrite you can use dbms_utility.expand_sql() to generate a suitable statement from which you can extract and edit the relevant pieces of text.

Footnote

A further point raised by Solomon Jakobson was that in the generic case (where you can’t use first_value() / last_value() to rewrite the SQL because you want to use an offset greater than one) it’s worth noting that the following two expressions are effectively identical:

        nvl(
                nth_value(a1.ed_ucc_dt, 2) from last ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt
                        rows between unbounded preceding and 1 preceding
                ),
                 null
        ) prev_ed_ucc_dt

        nvl(
                nth_value(a1.ed_ucc_dt, 2)           ignore nulls over (
                        partition by a1.pat_id order by a1.visit_date_dt desc
                        rows between 1 following and unbounded following
                ),
                 null
        ) prev_ed_ucc_dt
/

Note particularly that to avoid the issue of scanning the data “from last”, I’ve changed the ordering (in line 11) to descending and I’ve changed the preceding of line 4 to following in line 12 while reversing the positions of 1 and unbounded. (The same strategy can be applied for the lag() function in the original code).

In general it’s likely that lead() is will be more efficient than lag() so this change could make a big difference in performance. It’s worth noting, however, that if you’ve been using both lead() and lag() as my example did then your execution plan will (truthfully) show two “window sort” operations and the second (i.e. lower number in the plan) window sort will also be sorting the columns added by the first window sort. Depending on the nature of your data the additional sort might increase the workload by more than the benefit you get from eliminating a lag().

In an example created by Solomon Jakobson the effect of using lag( ,5) was catastrophic and the change that introduced the extra sort to get a lead( ,5) made a huge difference (40 seconds down to 0.15 seconds – and I haven’t made a mistake in the decimal point there); in my original test case the improvement I got from applying the same strategy was small (0.88 seconds down to 0.86 seconds). The critical difference between the test cases was that one had a single partition of 20,000 rows, the other had (10,000) partitions of only 20 rows.

(Footnote to the footnote – it did cross my mind that if you had a performance catastrophe with the simple lag(,n) and with the extra sort from reversing the order to use a lead(,n) would Oracle do something clever if you found a way to rewrite the query with a cascade of inline-views that all used last_value() – but I decided I wasn’t going to look at that option until someone paid me to. More seriously I did wonder about the option for using match_recognize to bypass the lag() problem, and may get around to thinking about that more carefully at some point.)

January 19, 2022

Hash Aggregation – 2

Filed under: Execution plans,Infrastructure,Oracle,Performance,Tuning — Jonathan Lewis @ 12:03 pm GMT Jan 19,2022

In the note I wrote a couple of days ago about the way the optimizer switches from hash group by to sort group by if you add an order by X,Y clause to a query that has a group by X,Y I mentioned that this had been something I’d noticed about 15 years ago (shortly after Oracle introduced hash aggregation, in fact) but it was only the second of two details I’d noted when experimenting with this new operation. The first thing I’d noticed came from an even simpler example, and here’s a (cryptic) clue to what it was:


column operation_type format a24

select
        operation_type, count(*)
from
        V$sql_workarea
group by
        operation_type
order by
        operation_type
;

OPERATION_TYPE             COUNT(*)
------------------------ ----------
APPROXIMATE NDV                   1
BUFFER                          130
CONNECT-BY (SORT)                10
GROUP BY (HASH)                  12
GROUP BY (SORT)                 240
HASH-JOIN                       466
IDX MAINTENANCE (SORT)           39
LOAD WRITE BUFFERS               10
RANGE TQ COMPUTATION             13
RESULT CACHE                      4
SORT (v1)                        10
SORT (v2)                       147
WINDOW (SORT)                    35

The clue isn’t in the query, it’s in what’s missing from the result set, so here’s some code to create and query some data to make it easier to spot the anomaly:

rem
rem     Script:         hash_agg.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sept 2007
rem

create table t1
as
with generator as (
        select  rownum id
        from    dual
        connect by
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        lpad(mod(rownum-1,1000),6)      small_vc_K,
        lpad(rownum-1,6)                small_vc_M
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6 -- > comment to avoid wordpress format issue
;

set serveroutput off

prompt  ===========================
prompt  1000 distinct values (hash)
prompt  ===========================

select
        /*+ gather_plan_statistics 1000 */
        count(*)
from
        (
        select  /*+ no_merge */
                distinct small_vc_K
        from
                t1
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  ===========================
prompt  1000 distinct values (sort)
prompt  ===========================

select
        /*+ gather_plan_statistics 1000 */
        count(*)
from
        (
        select  /*+ no_merge no_use_hash_aggregation */
                distinct small_vc_K
        from
                t1
        )
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

I’ve added the /*+ gather_plan_statistics */ hint to the query so that I can check on the rowsource execution stats and (particularly) the memory and/or temporary space used; and I’ve blocked hash aggregation in one of the two queries, so I expect to see a “hash unique” operation in the first query and a “sort unique” operation in the second. Here’s what I get from 19.11.0.0:


===========================
1000 distinct values (hash)
===========================

  COUNT(*)
----------
      1000

SQL_ID  1baadqgv02r6b, child number 0
-------------------------------------
select  /*+ gather_plan_statistics 1000 */  count(*) from  (  select
/*+ no_merge */   distinct small_vc_K  from   t1  )

Plan hash value: 171694178

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |   765 (100)|      1 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   1 |  SORT AGGREGATE      |      |      1 |      1 |            |      1 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   2 |   VIEW               |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.07 |    2637 |   2632 |       |       |          |
|   3 |    HASH UNIQUE       |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.07 |    2637 |   2632 |  1889K|  1889K| 1417K (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |   1000K|   405  (17)|   1000K|00:00:00.04 |    2637 |   2632 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------

17 rows selected.

===========================
1000 distinct values (sort)
===========================

  COUNT(*)
----------
      1000

SQL_ID  a66rqhgw7a6dk, child number 0
-------------------------------------
select  /*+ gather_plan_statistics 1000 */  count(*) from  (  select
/*+ no_merge no_use_hash_aggregation */   distinct small_vc_K  from
t1  )

Plan hash value: 1750119335

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |      1 |        |   765 (100)|      1 |00:00:00.22 |    2637 |       |       |          |
|   1 |  SORT AGGREGATE      |      |      1 |      1 |            |      1 |00:00:00.22 |    2637 |       |       |          |
|   2 |   VIEW               |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.22 |    2637 |       |       |          |
|   3 |    SORT UNIQUE       |      |      1 |   1000 |   765  (56)|   1000 |00:00:00.22 |    2637 | 48128 | 48128 |43008  (0)|
|   4 |     TABLE ACCESS FULL| T1   |      1 |   1000K|   405  (17)|   1000K|00:00:00.02 |    2637 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------

Look back at the summary of v$sql_workarea. Can you now spot something that might be missing? Since we’ve now got two execution plans with their sql_ids, let’s run a much more precise query against the view.

select
        sql_id, child_number, operation_id, operation_type ,
        total_executions, last_memory_used/1024 last_mem_kb
from
        v$sql_workarea
where
        sql_id in (
                '1baadqgv02r6b',
                'a66rqhgw7a6dk'
)
order by
        sql_id, child_number, operation_id
/

SQL_ID        CHILD_NUMBER OPERATION_ID OPERATION_TYPE       TOTAL_EXECUTIONS LAST_MEM_KB
------------- ------------ ------------ -------------------- ---------------- -----------
1baadqgv02r6b            0            3 GROUP BY (HASH)                     1       1,417
a66rqhgw7a6dk            0            3 GROUP BY (SORT)                     1          42

The first thing to note, of course, is that the sort aggregate at operation 1 doesn’t report a workarea at all; but this shouldn’t be a surprise, the operation is simply counting rows as they arrive, there’s no real sorting going on.

Operation 3 in both cases is the more interesting one. In the plan it’s reported as a “hash/sort unique” but the workarea operation has changed this to a “group by (hash/sort)”.  It’s a little detail that probably won’t matter to most people most of the time – but it’s the sort of thing that can waste a couple of hours of time when you’re trying to analyze a performance oddity.

And another thing …

You’ll have noticed, by the way, that the hash unique operation actually demanded far more memory than the sort unique operation; 1.4MB compared to 42KB. In a large production system this may be fairly irrelevant, especially since the discrepancy tends to disappear for higher volumes, and hardly matters if the operation spills to disc. However, if you have a large number of processes doing a lot of small queries using hash aggregation you may find that cumulatively they use up an unexpected fraction of your pga_aggregate_target.

There is some scope (with the usual “confirm with Oracle support” caveat) for modifying this behaviour with a fix_control:

16792882    QKSFM_COMPILATION_16792882  Cardinality threshold for hash aggregation(raised to power of 10)

alter session set "_fix_control"='16792882:3';

The default value is 0, the maximum value seems to be 3 and the description about raising to power of 10 seems to be the wrong way round, but I haven’t done any careful testing. When I set the value to 3 the hash unique with an estimated output of 1,000 switched to a sort unique.(hence my suggestion about 10N rather than N10).

Footnote

In the previous note I pointed out that Oracle would use a single sort group by operation rather than a consecutive hash group by / sort order by pair of operations. This simpler example helps to explain why. If you check the CBO trace files for the two you can check the costs of the inline aggregate view.

From a 19.11.0.0 trace file here are two costing fragments for the hash group by strategy; the first bit is the cost of the tablescan that acquires the rows (operation 4) the second is the final cost of the inline view / query block:


  Best:: AccessPath: TableScan
         Cost: 404.639881  Degree: 1  Resp: 404.639881  Card: 1000000.000000  Bytes: 0.000000

Final cost for query block SEL$2 (#2) - All Rows Plan:
  Best join order: 1
  Cost: 764.843155  Degree: 1  Card: 1000000.000000  Bytes: 7000000.000000
  Resc: 764.843155  Resc_io: 337.000000  Resc_cpu: 1069607888
  Resp: 764.843155  Resp_io: 337.000000  Resc_cpu: 1069607888

I won’t show you the equivalent extracts for the example where I blocked hash aggregation because the numbers are identical. So there’s no way that Oracle will want to do hash group by followed by sort order by, when it can do just a single sort group by that costs exactly the same as the hash group by operation alone. This is a shame, and a little ironic because when Oracle Corp introduce hash aggregation they made a bit of a fuss about how much more efficient it was than sorting- but it looks like no-one told the optimizer about this.

January 17, 2022

Hash Aggregation – 1

Filed under: CBO,Hints,Oracle,Performance,Tuning — Jonathan Lewis @ 1:15 pm GMT Jan 17,2022

Here’s an observation I made some time in 2007, but didn’t mention online until a (possibly) relevant question appeared on the Oracle database forum in 2017; and the topic reappeared in a nearly unrelated question a little while ago. The 2017 question was this:

I am running following SQL, which it taking almost 4-5 seconds and returning only 18 records.

SELECT
   SUM(TOTAL_AMOUNT) C1,
   a.LEVEL2_ENAME AS c2,
   c.CURR_YEARMO AS c3
FROM TERRITORY_PRESET_MSNP a,
   CV_RESTATED_MSNP b
   LEFT OUTER JOIN
   MONTH_D c
   ON b.YEARMO = c.CURR_YEARMO,
   PRODUCT_OFFERING d
WHERE   b.PO_ID = d.ROW_ID
    AND b.DATASOURCE_ID = 10
    AND b.YEARMO = 201704
    AND b.OWNER_TERR_ID = a.TERR_ID
    AND c.CURR_YEARMO = 201704
    AND a.YEARMO = 201706
GROUP BY c.CURR_YEARMO, a.LEVEL2_ENAME
ORDER BY C3, C2;

If I remove the ORDER BY clause it is returning results in 1 second.

Before saying anything else, I’ll just make a couple of points about the SQL:

  • It’s not a good idea to mix traditional Oracle syntax with “ANSI” syntax – it’s likely to make things harder for the next person to read the code and there’s just a slight possibility that the rewrite that Oracle applies to hide the ANSI syntax may block some of the possible execution paths.
  • The C3, C2 in the order by clause are the column aliases for the curr_yearno, level2_ename columns used in the group by clause.  Although Oracle allows you to use aliases in the order by (but not in the group by) doing so can only make the SQL a little harder to interpret (especially in a case like this when you have both clauses).
  • There’s a left outer join to month_d (aliased as c), but the where clause then includes the predicate c.CURR_YEARMO = 201704 which will eliminate any rows where curr_yearmo is null, thus converting (we hope – but the mix and match syntax might introduce a side-effect) the outer join to an inner join – so maybe that’s a design error in the SQL.

Addressing the question, though, the first thought (rapidly eliminated) is that perhaps this is the standard complaint of the GUI interface: “it’s fast until I add an order by clause”.

The commonest reason for this complaint is that the typical GUI interface shows you the first few rows and waits for you to page down, so your impression of the response time is “the time to see the first few rows” rather than “the time to get the complete result set” when it might take much more time to return the entire result set. When you add an order by clause it’s possible that Oracle will have to sort the entire result set before you see any of it. It’s often the difference between “first few rows” and “entire result set” that triggers the complaint.

In this case the “rapid elimination” of this thought is due to the OP saying the result set was only 18 rows. which is likely to produce the gut feeling that it shouldn’t take very long for Oracle to sort 18 rows if it had to find all of them before showing displaying them. On the other hand the thought might need a little follow-up, after all:

  • it’s possible that the GUI is only displaying 15 rows at a time and it’s takes a lot of time to find the extra 3 rows. Just think of a tablescan with a filter subquery when the rows you want are the first few in the table. Without an order by the rows can be displayed as they are found, with an order by Oracle will have to get to the end of the tablescan before the rows can be sorted and displayed.
  • the optimizer can produce terrible estimates and the order by clause might prompt it to say “if I start with a different table, driving through a specific index, and changing the join order then I won’t have to do any sorting for the order by clause” The resulting path may be a very bad idea if the arithmetic produces the wrong results.

The OP hasn’t shown us the execution plan – and that’s what we really need to see; but there is an alternative  guess that we could make about what the optimizer is doing that would affect the performance this much.

The query is an aggregate query – we have a group by. Since 10g Oracle has been able to use “hash aggregation” – which shows up as the HASH GROUP BY operation in an execution plan. Here’s a little demo script, with a couple of sample queries:

rem
rem     Script:         sort_hash.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2012
rem 

create table t1
as
with generator as (
        select
                rownum id 
        from dual 
        connect by 
                level <= 1e4  -- > comment to avoid wordpress format issue
)
select
        trunc(dbms_random.value(0,262144))      n_256K,
        trunc(dbms_random.value(0,131072))      n_128K,
        trunc(dbms_random.value(0,8192))        n_8k
from
        generator       v1,
        generator       v2
where
        rownum <= 8 * 1048576  -- > comment to avoid wordpress format issue
;

set arraysize 1000
set timing on
set autotrace traceonly 

prompt  ===========
prompt  No Order by
prompt  ===========

select
        n_8K, count(*) ct
from
        t1
group by
        n_8k
;


prompt  =============
prompt  With Order by
prompt  =============

select
        n_8K, count(*) ct
from
        t1
group by
        n_8k
order by
        1
;


My table has 8M rows, and my queries target the column with 8K distinct values. I’ve enabled autotrace from SQL*Plus, set a large arraysize (to reduce time lost to SQL*Net round-trips), and set timing on so we can get an elapsed time for total execution. I’ve set autotrace to “traceonly” so that the SQL*Plus client will fetch the data but won’t doesn’t waste resources formatting it, but I’m not actually interested in the handful of execution statistics that will be reported.

Here are the two sets of results from a test run on 19.11.0.0. Note, particularly, the figure for Elapsed:


===========
No Order by
===========

8192 rows selected.

Elapsed: 00:00:00.58

Execution Plan
----------------------------------------------------------
Plan hash value: 136660032

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
---------------------------------------------------------------------------

=============
With Order by
=============

8192 rows selected.

Elapsed: 00:00:03.03

Execution Plan
----------------------------------------------------------
Plan hash value: 3946799371

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  SORT GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
---------------------------------------------------------------------------

The time has jumped from slightly under 0.6 seconds to just over 3 seconds as the critical operation changes from a HASH GROUP BY to a SORT GROUP BY (even though the estimated cost, hence predicted run-time, of execution has not changed).

Your first thought at this point is probably along the lines of “surely it doesn’t take 2.4 seconds to sort 8,192 small rows, why doesn’t Oracle do a hash group by followed by a sort order by?” The answer seems to be “it just doesn’t”. So here’s one way to make it happen (with execution plan and elapsed time from 19.11.0.0 again):

select
        dist_8k, ct
from
        (
        select  /*+ no_merge */
                n_8K dist_8k, count(*) ct
        from
                t1
        group by
                n_8k
        )
order by 
        dist_8k
;

8192 rows selected.

Elapsed: 00:00:00.59

Execution Plan
----------------------------------------------------------
Plan hash value: 1705136228

-----------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   1 |  SORT ORDER BY       |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   2 |   VIEW               |      |  8192 |   208K|  6938  (61)| 00:00:01 |
|   3 |    HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
-----------------------------------------------------------------------------


Again the estimated cost of execution doesn’t (to the nearest whole number) change from the basic cost of the hash aggregation – but we have brought the time back down to just under 0.6 seconds.

It’s worth reminding you at this point that if you can re-engineer any SQL that’s performing badly and can see that the driving core of the query can be reduced to something much simpler and more efficient, then wrapping that core into an inline view with the /*+ no_merge */ hint (and possibly putting it up into a “with subquery” clause) might be the safest first step and most effective way of improving performance.

There is an option for avoiding the query rewrite here – hint the path you want to see:


select  /*+ use_hash_aggregation */
        n_8K, count(*) ct
from
        t1
group by
        n_8k
order by 
        1
;

8192 rows selected.

Elapsed: 00:00:00.59

Execution Plan
----------------------------------------------------------
Plan hash value: 2808104874

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   1 |  SORT ORDER BY      |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   2 |   HASH GROUP BY     |      |  8192 | 32768 |  6938  (61)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T1   |  8388K|    32M|  3460  (21)| 00:00:01 |
----------------------------------------------------------------------------

The nice thing about this, of course, is that you don’t actually have to edit the text; the hint could be attached to the query through an SQL Patch (or by abusing the SQL Profile or SQL Plan Baseline mechanisms).

The difficult part of hinting is finding the correct query block name for a more complex query. I simply added the hint /*+ use_hash_aggregation */ but the hint can be aimed at a query block so, in the absence of explicit query block names I could have used the hint /*+ use_hash_aggregation(@sel$1) */ using the default naming.

In a more complex case you can find the appropriate query block name by using the ‘alias’ format option when generating the execution plan. Consider the following query (where t2 and t3 are created from view all_objects), with its initial execution plan:


explain plan for
select 
        t2.owner, count(*)
from 
        t2 
where 
        t2.object_id in (
                select  t3.object_id 
                from    t3 
                where   t3.object_type = 'TABLE'
        )
group by 
        t2.owner
order by 
        t2.owner
/


select * from table(dbms_xplan.display(format=>'alias'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 2646727453

------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |    23 |   575 |   319   (8)| 00:00:01 |
|   1 |  SORT GROUP BY        |      |    23 |   575 |   319   (8)| 00:00:01 |
|*  2 |   HASH JOIN RIGHT SEMI|      |  2298 | 57450 |   317   (7)| 00:00:01 |
|*  3 |    TABLE ACCESS FULL  | T3   |  2298 | 34470 |   158   (7)| 00:00:01 |
|   4 |    TABLE ACCESS FULL  | T2   | 57448 |   561K|   156   (6)| 00:00:01 |
------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$5DA710D3
   3 - SEL$5DA710D3 / T3@SEL$2
   4 - SEL$5DA710D3 / T2@SEL$1

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T2"."OBJECT_ID"="T3"."OBJECT_ID")
   3 - filter("T3"."OBJECT_TYPE"='TABLE')

The Query Block Name / Object Alias information tells us that the query block holding the sort group by at operation 1 is named SEL$5DA710D3 so we can use that as the target query block in the hint: /*+ use_hash_aggregation(@SEL$5DA710D3) */ and the plan changes to:


-------------------------------------------------------------------------------
| Id  | Operation              | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |      |    23 |   575 |   319   (8)| 00:00:01 |
|   1 |  SORT ORDER BY         |      |    23 |   575 |   319   (8)| 00:00:01 |
|   2 |   HASH GROUP BY        |      |    23 |   575 |   319   (8)| 00:00:01 |
|*  3 |    HASH JOIN RIGHT SEMI|      |  2298 | 57450 |   317   (7)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL  | T3   |  2298 | 34470 |   158   (7)| 00:00:01 |
|   5 |     TABLE ACCESS FULL  | T2   | 57448 |   561K|   156   (6)| 00:00:01 |
-------------------------------------------------------------------------------

It’s possible that you’ll only notice a significant difference in performance (and, perhaps, PGA memory allocated) when you aggregate a large number of rows into a small result set. And it does seem that this is one case where the only way to get the plan you want without a significant code rewrite is through a hint

tl;dr

If you have some code which does a “group by X, Y, Z order by X, Y, Z” (the order of the columns/expressions used doesn’t have to be the same for the two clauses) then Oracle will use a sort group by operation to handle the aggregation and ordering in a single step, even though we can find cases where hash aggregation followed by sort ordering is more efficient.

If you come across such a case then injecting the hint /*+ use_hash_aggregation(@query_block_name) */ may be the only way to change the execution plan if you’re not allowed to edit the SQL.

Footnote

In the second of the two links to the Oracle Developer Forum you’ll see that one poster pointed out that if the order by clause uses a suitable expression to substitute for one of the columns in the group by clause then you don’t need to hint the code, e.g.

group by
        numeric_column
order by
        numeric_column + 0

Very cute, but not a good idea.

Footnote 2

There is another part to the testing I started in 2007, and I’ll come back to that later in Hash Aggregation – 2.

April 5, 2021

Case Study

Filed under: lateral view,Oracle,Tuning,Upgrades — Jonathan Lewis @ 3:36 pm BST Apr 5,2021

A recent question on the Oracle Developer forum posed an interesting question on “finding the closest match” to a numeric value. The OP supplied SQL to create two tables, first a set of “valid” values each with an Id, then a set of measures. The requirement was to find, for each measure, the closest valid value and report its id.

In this note I’m going to make a few comments on three topics:

  • how the question was posed,
  • general thoughts on modelling,
  • some ideas on what to look for when testing possible solutions

We’ll start with the data (almost) as supplied:

rem
rem     Script:         closest_match.sql
rem     Author:         Jonathan Lewis / user626688
rem     Dated:          Apr 2021
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem             11.2.0.4         (with event 22829)
rem
 
create table nom_val_lkp(
        lkp_id  number       not null,
        nom_val number(3,2)  primary key
)
-- organization index
/

insert into nom_val_lkp values(1, 0.1);
insert into nom_val_lkp values(2, 0.2);
insert into nom_val_lkp values(3, 0.3);
insert into nom_val_lkp values(4, 0.4);
insert into nom_val_lkp values(5, 0.5);
insert into nom_val_lkp values(6, 0.6);
insert into nom_val_lkp values(7, 0.7);
insert into nom_val_lkp values(8, 0.8);
insert into nom_val_lkp values(9, 0.9);
commit;

create table measure_tbl(
        id              number      not null, 
        measure_val     number(3,2) not null
)
/

insert into measure_tbl values(1, 0.24);
insert into measure_tbl values(2, 0.5);
insert into measure_tbl values(3, 0.14);
insert into measure_tbl values(4, 0.68);
commit;

insert into measure_tbl values(5, 1.38);
insert into measure_tbl values(6, 0.05);
commit;


execute dbms_stats.gather_table_stats(null,'measure_tbl')
execute dbms_stats.gather_table_stats(null,'nom_val_lkp')

There are a couple of differences between the original and the SQL I’ve listed above. Given the nature of the requirement I’ve added not null constraints to both the lkp_id and nom_val columns of the “valid values” table. I think it’s also reasonable to assume that both columns outght to be (individually) unique and could both be candidate keys for the table although I’ve not bothered to add a uniqueness constraint to the lkp_id. I have made the nom_val (the interesting bit) the primary key because that’s potentially an important feature of a good solution. Obviously this is guesswork on my part, but I think they’re reasonable guesses of what the “real application” will look like and they’re details that ought to be been included in the original specification.

You’ll see that I’ve also included the option for making the table an index organized table – but that’s a generic implementation choice for small look-up tables not something that you could call an omission in the specification of requirements.

One thing to note about the nom_val_lkp table is that the nom_val is strictly constrained to be 3 digits with 2 decimal places, which means values between -9.99 to +9.99. It’s going to be a pretty small table – no more than 1,999 rows. (In “real life” it’s possible that the measure all have to be postive – and if so that’s another detail that could have gone into the specification – so the column could also have a check constraint to that effect.)

Looking at the measure_tbl (which is the “big data” table) I’ve added not null constraints to both columns; I’ve also added a couple of extra rows to the table to make sure that we can test boundary conditions when we write the final SQL statement. We’re looking for “the closest match” so we’ll be looking in the nom_val_lkp table for values above and below the measure value – so we ought to have a measure row where there is no “below” value and one with no “above” value. A common oversight in modelling is to forget about checking special cases, and simple boundary conditions are often overlooked (or inadequately covered).

Thinking about the “above / below / closest” requirement, an immediate follow-up questions springs to mind. What if there is no exact match and the valid values either side are the same distance from the measure? If there’s a tie should the query return the lower value or the higher value, or does it not matter? The specification is not complete, and the most efficient solution may depend on this detail.

Interestingly the measure_val column is constrained in exactly the same way as the nom_val column -3 digits with 2 d.p. Apparently the requirement isn’t something like “take a measurement to 6 decimal places then give me a value to 2 d.p.”; no matter how large the measure_val table gets the number of distinct values it records is pretty small – which means caching considerations could become important. With this thought in mind I added a few more lines (before gathering stats) to make multiple copies of the supplied measures data to model (approximately, and with a very large bias) a large table with a small number of distinct measures.

insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
insert into measure_tbl select * from measure_tbl;
commit;

update measure_tbl set id = rownum;
commit;

execute dbms_stats.gather_table_stats(null,'measure_tbl')

This doubling-up code resulted in a total of 6 * 1,024 = 6,144 rows in the table. I only increased the data volume after I’d checked that I had a query that produced the correct results, of course.

A possible solution

By the time I saw the thread on the Oracle forum there were already three solutions on offer, but they all took the route of using analytic functions, including one that used keep(dense_rank …), and these all involved sorting the entire measures dataset; so I thought I’d try an approach that demonstrated a completely different method that was visibly following the strategy: “for each row do two high-precision lookups”. I implemented this by joining two lateral views of the lookup table to the measures table. Since I was sitting in front of a copy of 11.2.0.4 at the time I had to set the event 22829 to enable the feature – here’s the basic code with the plan produced by 11g:

select  /*+ qb_name(main) */
        mt.id,
        mt.measure_val,
        case
                when
                        nt_high.nom_val - mt.measure_val <=
                        mt.measure_val - nt_low.nom_val
                then    nvl(nt_high.lkp_id,nt_low.lkp_id)
                else    nvl(nt_low.lkp_id,nt_high.lkp_id)
        end     lkp_id,
        nt_low.nom_val  low_val,
        nt_low.lkp_id   low_lkp,
        nt_high.nom_val high_val,
        nt_high.lkp_id  high_lkp 
from
        measure_tbl     mt,
        lateral(
                select
                        /*+ qb_name(low) index_rs_desc(nt (nom_val)) */
                        nt.lkp_id, nt.nom_val
                from    nom_val_lkp nt
                where   nt.nom_val <= mt.measure_val
                and     rownum = 1
        )(+) nt_low,
        lateral(
                select
                        /*+ qb_name(high) index_rs_asc(nt (nom_val)) */
                        nt.lkp_id, nt.nom_val
                from    nom_val_lkp nt
                where   nt.nom_val >= mt.measure_val
                and     rownum = 1
        ) (+) nt_high
/

        ID MEASURE_VAL     LKP_ID    LOW_VAL    LOW_LKP   HIGH_VAL   HIGH_LKP
---------- ----------- ---------- ---------- ---------- ---------- ----------
         1         .24          2         .2          2         .3          3
         2          .5          5         .5          5         .5          5
         3         .14          1         .1          1         .2          2
         4         .68          7         .6          6         .7          7
         5        1.38          9         .9          9
         6         .05          1                               .1          1


6 rows selected.

You’ll notice that (for debugging purposes) I’ve included columns in my output for the lkp_id and nom_val just lower than (or matching) and just higher than (or matching) the measure_val. The blanks this produces in two of the rows conveniently highlights the cases where the measure is “out of bounds”.

With my tiny data set I had to include the index_rs_desc() hint. Of course I should really have included an “order by” clause in the two subqueries and used an extra layer of inline views to introduce the rownum = 1 predicate, viz:

        lateral(
                select  * 
                from    (
                        select  /*+ qb_name(low) */
                                nt.lkp_id, nt.nom_val
                        from    nom_val_lkp nt
                        where   nt.nom_val <= mt.measure_val
                        order by
                                nom_val desc
                )
                where   rownum = 1
        )(+) nt_low,

There were two reasons I didn’t do this: first I wanted to keep the code short, secondly it wouldn’t have worked with 11g because it was only in 12c that a correlated subquery could correlate more than one level up – the predicate referencing mt.measure_val would have raised error “ORA-00904: invalid identifier”.

If you’re not familiar with lateral views, the idea is quite simple: as with any inline view in the from clause it’s just a query that returns a result set that looks like a table, but it has the special condition that the predicates in the query can reference columns from tables (or views) that have appeared further to the left in (or, equivalently, further up) the from clause. In this case both of my inline views query nom_val_lkp and both of them reference a column in measure_tbl which was the first table in the from clause.

There are two distinguishing details that are a consequence of the lateral view. First, the view effectively has a join to the driving table built into it so my main query doesn’t have any where clause predicates joining the views to the rest of the query. Se,condly I want to do outer joins (to deal with the cases where there isn’t a nom_val higher/ lower than the measure_val) so in the absence of a join predicate in the main query the necessary syntax simply adds Oracle’s traditional “(+)” to the lateral() operator itself. (If you want to go “full-ANSI” you would use outer apply() instead of lateral()(+) at this point – but 11g doesn’t support outer apply().

Here’s the execution plan from 11g for this query – I’ve enabled rowsource execution stats and pulled the plan from memory using the ‘allstats last’ format option:

select * from table(dbms_xplan.display_cursor(format=>'allstats last'));

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |              |      1 |        |      6 |00:00:00.01 |      29 |
|   1 |  NESTED LOOPS OUTER              |              |      1 |      6 |      6 |00:00:00.01 |      29 |
|   2 |   NESTED LOOPS OUTER             |              |      1 |      6 |      6 |00:00:00.01 |      18 |
|   3 |    TABLE ACCESS FULL             | MEASURE_TBL  |      1 |      6 |      6 |00:00:00.01 |       7 |
|   4 |    VIEW                          |              |      6 |      1 |      5 |00:00:00.01 |      11 |
|*  5 |     COUNT STOPKEY                |              |      6 |        |      5 |00:00:00.01 |      11 |
|   6 |      TABLE ACCESS BY INDEX ROWID | NOM_VAL_LKP  |      6 |      2 |      5 |00:00:00.01 |      11 |
|*  7 |       INDEX RANGE SCAN DESCENDING| SYS_C0072287 |      6 |      6 |      5 |00:00:00.01 |       6 |
|   8 |   VIEW                           |              |      6 |      1 |      5 |00:00:00.01 |      11 |
|*  9 |    COUNT STOPKEY                 |              |      6 |        |      5 |00:00:00.01 |      11 |
|  10 |     TABLE ACCESS BY INDEX ROWID  | NOM_VAL_LKP  |      6 |      1 |      5 |00:00:00.01 |      11 |
|* 11 |      INDEX RANGE SCAN            | SYS_C0072287 |      6 |      4 |      5 |00:00:00.01 |       6 |
-----------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   5 - filter(ROWNUM=1)
   7 - access("NT"."NOM_VAL"<="MT"."MEASURE_VAL")
       filter("NT"."NOM_VAL"<="MT"."MEASURE_VAL")
   9 - filter(ROWNUM=1)
  11 - access("NT"."NOM_VAL">="MT"."MEASURE_VAL")


As you can see we’ve done a full tablescan of measure_tbl, then performed an outer join to each of two (unnamed) views for each row, and each time we’ve accessed a view we’ve done an index range scan (descending in one case) into nom_val_lkp. passing in (according to the Predicate Information) the measure_val from measure_tbl.

It’s a little oddity I hadn’t noticed before that the ascending and descending range scans behave slightly differently – the descending range scan says we’ve used the predicate as both an access and a filter predicate. I’ll have to check whether this is always the case or whether it’s version-dependent or whether it’s only true under some conditions.

The only other detail to mention is the expression I’ve used to report the closest match – which is a little messy to allow for “out of range” measures::

        case
                when
                        nt_high.nom_val - mt.measure_val <=
                        mt.measure_val - nt_low.nom_val
                then    nvl(nt_high.lkp_id,nt_low.lkp_id)
                else    nvl(nt_low.lkp_id,nt_high.lkp_id)
        end     lkp_id,

This case expression says that if the higher nom_val is closer to (or, to be precise, not further from) the meause_val than the lower nom_val then report the higher lkp_id. otherwise report the lower lkp_id. The ordering of the comparison means that when the differences are the same the higher value will always be reported; and the “cross-over” use of the nvl() function ensures that when the measure_val is out of range (which means one of the nom_val subqueries will have returned null) we see the nom_val that’s at the end of the range rather than a null.

Some bad news

At first sight the lateral() view looks as if it might be a candidate for scalar subquery caching – so when I create multiple copies of the 6 rows in the measure_tbl and run my query against the expanded data set I might hope to get excellent performance because Oracle might only have to call each lateral view once and and cache the subquery inputs and results from that point onwards. But here are the stats I get from the 11g plan after exanding the data to 6,144 rows:

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |              |      1 |        |   6144 |00:00:00.82 |   22953 |
|   1 |  NESTED LOOPS OUTER              |              |      1 |   6144 |   6144 |00:00:00.82 |   22953 |
|   2 |   NESTED LOOPS OUTER             |              |      1 |   6144 |   6144 |00:00:00.47 |   11689 |
|   3 |    TABLE ACCESS FULL             | MEASURE_TBL  |      1 |   6144 |   6144 |00:00:00.03 |     425 |
|   4 |    VIEW                          |              |   6144 |      1 |   5120 |00:00:00.28 |   11264 |
|*  5 |     COUNT STOPKEY                |              |   6144 |        |   5120 |00:00:00.20 |   11264 |
|   6 |      TABLE ACCESS BY INDEX ROWID | NOM_VAL_LKP  |   6144 |      2 |   5120 |00:00:00.12 |   11264 |
|*  7 |       INDEX RANGE SCAN DESCENDING| SYS_C0072291 |   6144 |      5 |   5120 |00:00:00.04 |    6144 |
|   8 |   VIEW                           |              |   6144 |      1 |   5120 |00:00:00.32 |   11264 |
|*  9 |    COUNT STOPKEY                 |              |   6144 |        |   5120 |00:00:00.19 |   11264 |
|  10 |     TABLE ACCESS BY INDEX ROWID  | NOM_VAL_LKP  |   6144 |      2 |   5120 |00:00:00.11 |   11264 |
|* 11 |      INDEX RANGE SCAN            | SYS_C0072291 |   6144 |      3 |   5120 |00:00:00.04 |    6144 |
-----------------------------------------------------------------------------------------------------------


Look at the Starts column: the two views were called once each for every single row in the expanded measure_tbl, there’s no scalar subquery caching going on.

Bug time (1)

Of course, this is 11g and I’ve enabled lateral views by setting an event; it’s not an officially supported feature so maybe if I upgrade to 12c (or 19c), where the feature is official, Oracle will do better.

Here are the results of the original query against the original data set in 12c and 19c:

        ID MEASURE_VAL     LKP_ID    LOW_VAL    LOW_LKP   HIGH_VAL   HIGH_LKP
---------- ----------- ---------- ---------- ---------- ---------- ----------
         6         .05          1                               .1          1
         3         .14          1         .1          1
         1         .24          1         .1          1
         2          .5          1         .1          1
         4         .68          1         .1          1
         5        1.38          1         .1          1

On the upgrade I’ve got the wrong results! So what does the execution plan look like:

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |                 |      1 |        |      6 |00:00:00.01 |      17 |       |       |          |
|   1 |  MERGE JOIN OUTER       |                 |      1 |      6 |      6 |00:00:00.01 |      17 |       |       |          |
|   2 |   SORT JOIN             |                 |      1 |      6 |      6 |00:00:00.01 |      12 |  2048 |  2048 | 2048  (0)|
|   3 |    MERGE JOIN OUTER     |                 |      1 |      6 |      6 |00:00:00.01 |      12 |       |       |          |
|   4 |     SORT JOIN           |                 |      1 |      6 |      6 |00:00:00.01 |       7 |  2048 |  2048 | 2048  (0)|
|   5 |      TABLE ACCESS FULL  | MEASURE_TBL     |      1 |      6 |      6 |00:00:00.01 |       7 |       |       |          |
|*  6 |     SORT JOIN           |                 |      6 |      1 |      5 |00:00:00.01 |       5 |  2048 |  2048 | 2048  (0)|
|   7 |      VIEW               | VW_DCL_A18161FF |      1 |      1 |      1 |00:00:00.01 |       5 |       |       |          |
|*  8 |       COUNT STOPKEY     |                 |      1 |        |      1 |00:00:00.01 |       5 |       |       |          |
|   9 |        TABLE ACCESS FULL| NOM_VAL_LKP     |      1 |      1 |      1 |00:00:00.01 |       5 |       |       |          |
|* 10 |   SORT JOIN             |                 |      6 |      1 |      1 |00:00:00.01 |       5 |  2048 |  2048 | 2048  (0)|
|  11 |    VIEW                 | VW_DCL_A18161FF |      1 |      1 |      1 |00:00:00.01 |       5 |       |       |          |
|* 12 |     COUNT STOPKEY       |                 |      1 |        |      1 |00:00:00.01 |       5 |       |       |          |
|  13 |      TABLE ACCESS FULL  | NOM_VAL_LKP     |      1 |      1 |      1 |00:00:00.01 |       5 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   6 - access(INTERNAL_FUNCTION("NOM_VAL")<=INTERNAL_FUNCTION("MT"."MEASURE_VAL"))
       filter(INTERNAL_FUNCTION("NOM_VAL")<=INTERNAL_FUNCTION("MT"."MEASURE_VAL"))
   8 - filter(ROWNUM=1)
  10 - access("NOM_VAL">="MT"."MEASURE_VAL")
       filter("NOM_VAL">="MT"."MEASURE_VAL")
  12 - filter(ROWNUM=1)


Check what’s appeared in the Name for the view operations 7 and 11: VW_DCL_ A18161FF (DCL ~ “decorrelate”), I was expecting to see names starting with VW_LAT (LAT ~ “lateral”). And then I remembered reading this article by Sayan Malakshinov – Oracle (12c+) can decorrelate lateral views but gets the wrong results with rownum. So let’s add in a couple of hints to avoid decorrelation /*+ no_decorrelate(@low) no_decorrelate(@high) */ and check the results and execution plan.

-------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                 |      1 |        |      6 |00:00:00.01 |      30 |       |       |          |
|   1 |  MERGE JOIN OUTER                        |                 |      1 |      6 |      6 |00:00:00.01 |      30 |       |       |          |
|   2 |   MERGE JOIN OUTER                       |                 |      1 |      6 |      6 |00:00:00.01 |      19 |       |       |          |
|   3 |    TABLE ACCESS FULL                     | MEASURE_TBL     |      1 |      6 |      6 |00:00:00.01 |       8 |       |       |          |
|   4 |    BUFFER SORT                           |                 |      6 |      1 |      5 |00:00:00.01 |      11 |  2048 |  2048 | 2048  (0)|
|   5 |     VIEW                                 | VW_LAT_D77DA787 |      6 |      1 |      5 |00:00:00.01 |      11 |       |       |          |
|*  6 |      COUNT STOPKEY                       |                 |      6 |        |      5 |00:00:00.01 |      11 |       |       |          |
|   7 |       TABLE ACCESS BY INDEX ROWID BATCHED| NOM_VAL_LKP     |      6 |      2 |      5 |00:00:00.01 |      11 |       |       |          |
|*  8 |        INDEX RANGE SCAN                  | SYS_C0055681    |      6 |      3 |      5 |00:00:00.01 |       6 |       |       |          |
|   9 |   BUFFER SORT                            |                 |      6 |      1 |      5 |00:00:00.01 |      11 |  2048 |  2048 | 2048  (0)|
|  10 |    VIEW                                  | VW_LAT_D77DA787 |      6 |      1 |      5 |00:00:00.01 |      11 |       |       |          |
|* 11 |     COUNT STOPKEY                        |                 |      6 |        |      5 |00:00:00.01 |      11 |       |       |          |
|  12 |      TABLE ACCESS BY INDEX ROWID BATCHED | NOM_VAL_LKP     |      6 |      2 |      5 |00:00:00.01 |      11 |       |       |          |
|* 13 |       INDEX RANGE SCAN DESCENDING        | SYS_C0055681    |      6 |      5 |      5 |00:00:00.01 |       6 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   6 - filter(ROWNUM=1)
   8 - access("NT"."NOM_VAL">="MT"."MEASURE_VAL")
  11 - filter(ROWNUM=1)
  13 - access("NT"."NOM_VAL"<="MT"."MEASURE_VAL")
       filter("NT"."NOM_VAL"<="MT"."MEASURE_VAL")

Blocking decorrelation was sufficient to get the correct result but there’s still a funny little glitch in the execution plan: why do we have merge join (outer) for operations 1 and 2?

It’s not quite the threat you might think; we’re not multiplying up rows catastrophically. For each row in measures_tbl Oracle does a Cartesian merge join to (at most) one row in each view – so there’s no accidental explosion in data volume, and there’s no real sorting. Nevertheless there may be unnecessary CPU usage so let’s add a few more hints to try and get a nested loop by adding the following hints to the start of the query:

        /*+
                qb_name(main)
                leading(@main mt@main nt_high@main nt_low@main)
                use_nl(@main nt_high@main)
                use_nl(@main nt_low@main)
        */

I was a little surprised at the benefit – roughly a 30% saving on CPU for the same data set.

But there’s more to investigate – I didn’t like the index hints that I’d had to use in 11g, but 12c allows for the more complex “two layer” lateral subquery with its deeply correlated predicate – so what happens if I use the following corrected query (with minimal hinting) in 12c or 19c:

select
        /*+
                qb_name(main)
--              leading(@main mt@main nt_high@main nt_low@main)
--              use_nl(@main nt_high@main)
--              use_nl(@main nt_low@main)
        */
        mt.id,
        mt.measure_val,
        case
                when
                        nt_high.nom_val - mt.measure_val <=
                        mt.measure_val - nt_low.nom_val
                then    nvl(nt_high.lkp_id,nt_low.lkp_id)
                else    nvl(nt_low.lkp_id,nt_high.lkp_id)
        end     lkp_id,
        nt_low.nom_val  low_val,
        nt_low.lkp_id   low_lkp,
        nt_high.nom_val high_val,
        nt_high.lkp_id  high_lkp 
from
        measure_tbl     mt,
        lateral(
                select  *
                from    (
                        select  /*+ qb_name(low) */
                                nt.lkp_id, nt.nom_val
                        from    nom_val_lkp nt
                        where   nt.nom_val <= mt.measure_val
                        order by
                                nom_val desc
                        )
                where   rownum = 1
        )(+) nt_low,
        lateral(
                select  *
                from    (
                        select  /*+ qb_name(high) */
                                nt.lkp_id, nt.nom_val
                        from    nom_val_lkp nt
                        where   nt.nom_val >= mt.measure_val
                        order by
                                nom_val
                )
                where   rownum = 1
        )(+) nt_high
/

First – Oracle doesn’t use decorrelation so I get the right results; secondly Oracle uses the correct index descending without hinting, which is an important part of getting the right results. Unfortunately I still see merge joins unless I include the use_nl() hints (with the leading() hint as an extra safety barrier) to get that 30% reduction in CPU usage.

The sad news is that I still don’t see scalar subquery caching. If I have 6,144 rows in measure_tbl I still see 6,144 executions of both the lateral subqueries.

Since 12c onwards supports “outer apply” it’s worth testing to see what happens if I replace my lateral()(+) operator with the outer apply() mechanism. It works, but behaves very much like the lateral approach (including the unexpected merge joins unless hinted), except it introduces another layer of lateral joins. Here’s the plan (12c and 19c) with 6,144 rows:

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |   6144 |00:00:00.14 |   22954 |       |       |          |
|   1 |  MERGE JOIN OUTER                   |                 |      1 |   6144 |   6144 |00:00:00.14 |   22954 |       |       |          |
|   2 |   MERGE JOIN OUTER                  |                 |      1 |   6144 |   6144 |00:00:00.08 |   11690 |       |       |          |
|   3 |    TABLE ACCESS FULL                | MEASURE_TBL     |      1 |   6144 |   6144 |00:00:00.01 |     426 |       |       |          |
|   4 |    BUFFER SORT                      |                 |   6144 |      1 |   5120 |00:00:00.06 |   11264 |  2048 |  2048 | 2048  (0)|
|   5 |     VIEW                            | VW_LAT_F8C248CF |   6144 |      1 |   5120 |00:00:00.04 |   11264 |       |       |          |
|   6 |      VIEW                           | VW_LAT_A18161FF |   6144 |      1 |   5120 |00:00:00.04 |   11264 |       |       |          |
|*  7 |       COUNT STOPKEY                 |                 |   6144 |        |   5120 |00:00:00.03 |   11264 |       |       |          |
|   8 |        VIEW                         |                 |   6144 |      2 |   5120 |00:00:00.03 |   11264 |       |       |          |
|   9 |         TABLE ACCESS BY INDEX ROWID | NOM_VAL_LKP     |   6144 |      6 |   5120 |00:00:00.02 |   11264 |       |       |          |
|* 10 |          INDEX RANGE SCAN DESCENDING| SYS_C0023500    |   6144 |      2 |   5120 |00:00:00.01 |    6144 |       |       |          |
|  11 |   BUFFER SORT                       |                 |   6144 |      1 |   5120 |00:00:00.06 |   11264 |  2048 |  2048 | 2048  (0)|
|  12 |    VIEW                             | VW_LAT_F8C248CF |   6144 |      1 |   5120 |00:00:00.04 |   11264 |       |       |          |
|  13 |     VIEW                            | VW_LAT_E88661A9 |   6144 |      1 |   5120 |00:00:00.04 |   11264 |       |       |          |
|* 14 |      COUNT STOPKEY                  |                 |   6144 |        |   5120 |00:00:00.03 |   11264 |       |       |          |
|  15 |       VIEW                          |                 |   6144 |      1 |   5120 |00:00:00.02 |   11264 |       |       |          |
|  16 |        TABLE ACCESS BY INDEX ROWID  | NOM_VAL_LKP     |   6144 |      1 |   5120 |00:00:00.02 |   11264 |       |       |          |
|* 17 |         INDEX RANGE SCAN            | SYS_C0023500    |   6144 |      4 |   5120 |00:00:00.01 |    6144 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):
---------------------------------------------------
   7 - filter(ROWNUM=1)
  10 - access("NT"."NOM_VAL"<="MT"."MEASURE_VAL")
  14 - filter(ROWNUM=1)
  17 - access("NT"."NOM_VAL">="MT"."MEASURE_VAL")

Note operations 5 and 6, then 12 and 13: the “ANSI” syntax outer apply seems to be another case of Oracle doing more work because it has to transform the query before optimising.

A Traditional Solution

Having worked through a few of the newer mechanisms in Oracle, why not think back to how the same pattern of implementation could have been achieved in older versions of Oracle. What’s wrong, for example, with using scalar subqueries in the select list? If we can expect plenty of scalar subquery caching this might be a very effective way of writing the query.

The immediate problem, though, is that scalar subqueries in the select list only allow one column to be returned (unless you want to fake things through by playing nasty games with user-defined types). So our two lateral views will have to change to four scalar subqueres to get all the data we need.

Here’s a possible solution (I’ve stuck with the hinted shorter, but bad practice, “first row” mechanism for compactness) – with execution stats:

select
        id,
        measure_val,
        case
                when
                        nt_high_nom_val - measure_val <=
                        measure_val - nt_low_nom_val
                then    nvl(nt_high_lkp_id,nt_low_lkp_id)
                else    nvl(nt_low_lkp_id,nt_high_lkp_id)
        end     lkp_id,
        nt_low_nom_val,
        nt_low_lkp_id,
        nt_high_nom_val,
        nt_high_lkp_id
from    (
        select
                mt.id,
                mt.measure_val,
                (
                        select
                                /*+ index_rs_asc(nt (nom_val)) */
                                nt.lkp_id
                        from    nom_val_lkp nt
                        where   nt.nom_val >= mt.measure_val
                        and     rownum = 1
                ) nt_high_lkp_id,
                (
                        select
                                /*+ index_rs_asc(nt (nom_val)) */
                                nt.nom_val
                        from    nom_val_lkp nt
                        where   nt.nom_val >= mt.measure_val
                        and     rownum = 1
                ) nt_high_nom_val,
                (
                        select
                                /*+ index_rs_desc(nt (nom_val)) */
                                nt.lkp_id
                        from    nom_val_lkp nt
                        where   nt.nom_val <= mt.measure_val
                        and     rownum = 1
                ) nt_low_lkp_id,
                (
                        select
                                /*+ index_rs_desc(nt (nom_val)) */
                                nt.nom_val
                        from    nom_val_lkp nt
                        where   nt.nom_val <= mt.measure_val
                        and     rownum = 1
                ) nt_low_nom_val
        from
                measure_tbl     mt
        )
/

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |              |      1 |        |   6144 |00:00:00.01 |     426 |
|*  1 |  COUNT STOPKEY                          |              |      6 |        |      5 |00:00:00.01 |       6 |
|*  2 |   INDEX RANGE SCAN                      | SYS_C0023507 |      6 |      1 |      5 |00:00:00.01 |       6 |
|*  3 |   COUNT STOPKEY                         |              |      6 |        |      5 |00:00:00.01 |       6 |
|*  4 |    INDEX RANGE SCAN DESCENDING          | SYS_C0023507 |      6 |      1 |      5 |00:00:00.01 |       6 |
|*  5 |    COUNT STOPKEY                        |              |      6 |        |      5 |00:00:00.01 |      11 |
|   6 |     TABLE ACCESS BY INDEX ROWID BATCHED | NOM_VAL_LKP  |      6 |      1 |      5 |00:00:00.01 |      11 |
|*  7 |      INDEX RANGE SCAN                   | SYS_C0023507 |      6 |      1 |      5 |00:00:00.01 |       6 |
|*  8 |     COUNT STOPKEY                       |              |      6 |        |      5 |00:00:00.01 |      11 |
|   9 |      TABLE ACCESS BY INDEX ROWID BATCHED| NOM_VAL_LKP  |      6 |      1 |      5 |00:00:00.01 |      11 |
|* 10 |       INDEX RANGE SCAN DESCENDING       | SYS_C0023507 |      6 |      1 |      5 |00:00:00.01 |       6 |
|  11 |  TABLE ACCESS FULL                      | MEASURE_TBL  |      1 |   6144 |   6144 |00:00:00.01 |     426 |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(ROWNUM=1)
   2 - access("NT"."NOM_VAL">=:B1)
   3 - filter(ROWNUM=1)
   4 - access("NT"."NOM_VAL"<=:B1)
       filter("NT"."NOM_VAL"<=:B1)
   5 - filter(ROWNUM=1)
   7 - access("NT"."NOM_VAL">=:B1)
   8 - filter(ROWNUM=1)
  10 - access("NT"."NOM_VAL"<=:B1)
       filter("NT"."NOM_VAL"<=:B1)

I’ve left the index hints in place in this example so that the code can run on 11g and earlier (without setting any special events, of course); but in 12c and 19c if you replace the subqueries with the double-layer subqueries (inline order by, then rownum = 1) as shown further up the page the hints (specifically the descending hints) are no longer necessary.

The key performance benefit of this approach is visible in the Starts column – although I now have 4 subqueries to run (which should mean doing more work) each one runs only once thanks to an extremely “lucky” level of scalar subquery caching.

This, really, is where this note takes us back to the beginning. Will this be a fantastic solution for the end-user, or does the pattern of the data mean that it’s going to be a total disaster. It’s nice to see the SQL that defines the tables and supplies a bit of test data – but there’s not point in trying to provide a solution without a better idea of what the data really looks like and what the critical usage is in production.

Bug time (2)

Nothing’s perfect, of course – and even though this last SQL statement is pretty simple and its execution plan is (for the right data pattern) very efficient, the shape of the plan is wrong – and in more complex plans you could be fooled into thinking that Oracle isn’t doing what you want it do.

Operations 1,3,5,8 and 11 should all be at the same depth (you’ll find that they all have parent_id = 0 if you look at the underlying data in v$sql_plan): there’s a defect in Oracle’s calculation of the depth column of v$sql_plan (et. al.) that introduces a pattern of indentation that shouldn’t be there.

Summary

This has been a fairly informal ramble through the playing around that I did after I read the original post. It holds some comments about the way the question was asked, the test data as supplied and corrected, and the observations and tweaks as the testing progressed.

On the plus size, the OP has supplied code to create and populate a model, and described what they wanted to see as a result. However the requirement didn’t mention (and the model therefore didn’t cater for) a couple of special cases. There were also a few cases where unique and mandatory columns were likely to be appropriate but were not mentioned, even though they could affect the correctness or performance of any suggested solutions.

More importantly, although the model implied some fairly narrow restrictions on what the production data might look like this information wasn’t presented explcitily, and there were no comments about the ultimate scale and distribution patterns of the data that might give some clues about the most appropriate features of SQL to use.

October 9, 2020

Inline Hint

Filed under: 18c,CBO,Execution plans,Hints,Oracle,subqueries,Subquery Factoring,Tuning — Jonathan Lewis @ 12:46 pm BST Oct 9,2020

If you’ve ever used subquery factoring (“with” subqueries or common table expressions (CTEs) as they are often called) then you’re probably aware of the (undocumented) hints /*+ materialize */ , which forces Oracle to create a local temporary table to hold the result of the subquery for subsequent use, and /*+ inline */, which forces the optimizer to copy the text of the subquery into the body of the query before starting the optimisation phase.

There’s a small, but important, enhancement to these hints that appeared in Oracle 18. Like so many other hints in Oracle they can now have a query block name as a “parameter”, so you can use them at the top level of your query. Here’s some code to demonstrate:

rem
rem     Script:         inline_hint.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             18.3.0.0
rem             12.2.0.1  -- hints don't have any effect
rem

create table t1
as
select  *
from    all_objects
where   rownum <= 10000  -- > comment to avoid wordpress format issue
/

create index t1_i1 on t1(object_id);

create table t2
as
select  *
from    t1
/

create index t2_i1 on t2(object_id);

spool inline_hint.lst


explain plan for
with v1 as (
        select 
                /*+ 
                        qb_name(cte) 
                */ 
                object_id, object_type, object_name 
                from t2 gtt1
                where object_id is not null
)
select
        /*+
                qb_name(main)
                inline(@cte)
        */
        t1.object_id,
        t1.object_name
from
        t1
where
        exists (
                select
                        null
                from
                        v1      v1a
                where
                        v1a.object_id = t1.object_id
                and     v1a.object_type = 'TABLE'
        )
and     exists (
                select
                        null
                from
                        v1      v1b
                where
                        v1b.object_id = t1.object_id
                and     v1b.object_name like 'WRI%'
        )
and
        t1.object_id between 100 and 200
/

select * from table(dbms_xplan.display(format=>'alias'));

explain plan for
with v1 as (
        select 
                /*+ 
                        qb_name(cte) 
                */ 
                object_id, object_type, object_name 
                from t2 gtt1
                where object_id is not null
)
select
        /*+
                qb_name(main)
                materialize(@cte)
        */
        t1.object_id,
        t1.object_name
from
        t1
where
        exists (
                select
                        null
                from
                        v1      v1a
                where
                        v1a.object_id = t1.object_id
                and     v1a.object_type = 'TABLE'
        )
and
        t1.object_id between 100 and 200
/

select * from table(dbms_xplan.display(format=>'alias'));

The first of these two queries uses the factored subquery twice so, by default, it will create a “cursor duration memory” temporary table to hold the results of the subquery and then use that temporary table twice in the execution plan.

Conversely the second query uses the factored subquery just once, so the optimizer’s default action will be to copy the text into the body of the main query and optimize the whole thing as a single query block.

To reverse the default behaviour in versions of Oracle up to 12.2.0.1 (though later patch sets may include the 18c enhancements) you could add the /*+ inline */ or /*+ materialize */ hints respectively to the factored subqueries; but my demonstration you can see that I’ve given the factored subquery a query block name and added the relevant hint to the main query block passing in the query block name of the factored subquery – hence /*+ inline(@cte) */ and /*+ materialize(@cte) */.

Here – from 19.3 – are the resulting execution plans (with some cosmetic editing) – first the plan with the inline() hint.

------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |       |     1 |    63 |     9  (12)| 00:00:01 |
|   1 |  NESTED LOOPS SEMI                     |       |     1 |    63 |     9  (12)| 00:00:01 |
|   2 |   NESTED LOOPS                         |       |     1 |    50 |     7  (15)| 00:00:01 |
|   3 |    SORT UNIQUE                         |       |     1 |    25 |     4   (0)| 00:00:01 |
|*  4 |     TABLE ACCESS BY INDEX ROWID BATCHED| T2    |     1 |    25 |     4   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN                  | T2_I1 |    48 |       |     2   (0)| 00:00:01 |
|   6 |    TABLE ACCESS BY INDEX ROWID BATCHED | T1    |     1 |    25 |     2   (0)| 00:00:01 |
|*  7 |     INDEX RANGE SCAN                   | T1_I1 |     1 |       |     1   (0)| 00:00:01 |
|*  8 |   TABLE ACCESS BY INDEX ROWID BATCHED  | T2    |     1 |    13 |     2   (0)| 00:00:01 |
|*  9 |    INDEX RANGE SCAN                    | T2_I1 |     1 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$06B48120
   4 - SEL$06B48120 / GTT1@CTE
   5 - SEL$06B48120 / GTT1@CTE
   6 - SEL$06B48120 / T1@MAIN
   7 - SEL$06B48120 / T1@MAIN
   8 - SEL$06B48120 / GTT1@CTE
   9 - SEL$06B48120 / GTT1@CTE

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("OBJECT_NAME" LIKE 'WRI%')
   5 - access("OBJECT_ID">=100 AND "OBJECT_ID"<=200)
   7 - access("OBJECT_ID"="T1"."OBJECT_ID")
       filter("T1"."OBJECT_ID"<=200 AND "T1"."OBJECT_ID">=100)
   8 - filter("OBJECT_TYPE"='TABLE')
   9 - access("OBJECT_ID"="T1"."OBJECT_ID")
       filter("OBJECT_ID"<=200 AND "OBJECT_ID">=100)

As you can see Oracle has copied the subquery text into the main body of the text and then optimized to produce a three-table join. One of the subqueries has been unnested into an aggregate view (operations 3,4,5), the other has been transformed into a semi-join.

In passing you’ll also notice that the optimizer has used transitive closure to add the range predicate on t1 to both occurrences of the t2 table.

And here’s the plan for the query with the single use of the subquery and materialize() hint:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |    48 |  2448 |    39   (8)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6611_F53A566 |       |       |            |          |
|   3 |    TABLE ACCESS FULL                     | T2                         | 10000 |   322K|    27   (8)| 00:00:01 |
|*  4 |   HASH JOIN SEMI                         |                            |    48 |  2448 |    13  (16)| 00:00:01 |
|   5 |    TABLE ACCESS BY INDEX ROWID BATCHED   | T1                         |    48 |  1200 |     4   (0)| 00:00:01 |
|*  6 |     INDEX RANGE SCAN                     | T1_I1                      |    48 |       |     2   (0)| 00:00:01 |
|*  7 |    VIEW                                  |                            | 10000 |   253K|     8  (13)| 00:00:01 |
|   8 |     TABLE ACCESS FULL                    | SYS_TEMP_0FD9D6611_F53A566 | 10000 |   322K|     8  (13)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$A3F38ADC
   2 - CTE
   3 - CTE          / GTT1@CTE
   5 - SEL$A3F38ADC / T1@MAIN
   6 - SEL$A3F38ADC / T1@MAIN
   7 - SEL$AA28F105 / V1A@SEL$1
   8 - SEL$AA28F105 / T1@SEL$AA28F105

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("V1A"."OBJECT_ID"="T1"."OBJECT_ID")
   6 - access("T1"."OBJECT_ID">=100 AND "T1"."OBJECT_ID"<=200)
   7 - filter("V1A"."OBJECT_TYPE"='TABLE' AND "V1A"."OBJECT_ID">=100 AND "V1A"."OBJECT_ID"<=200)

In this plan the optimizer has created an in-memory temporary table and then used it in the existence subquery – which it has then transformed into a semi-join, so we have a query block with the name SEL$A3F38ADC; but we also see that the query block CTE still exists, labelling the operations that Oracle used to populate the temporary table.

It is an interesting (and irritating) detail that when we look at object aliases we see (operation 8) that Oracle has given the temporary table the alias of t1 – which is just a little confusing since I actually have a table called t1!

Next Steps

Being able to nominate a query block for the inline() and materialize() hints may be of great help in some cases (there’s a recent example on the Oracle Developer Forum (may need a MOS login) where it might make a huge difference to the performance of a particular query without requiring a rewrite of the SQL).

But there are a couple of details to investigate. First, I had a query block name built into my factored subquery – what happens if the author of the SQL didn’t include a query block name?

Before I’d added the inline() hint and query block names in the first example above this is what the plan looked like:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |    48 |  6240 |    48  (11)| 00:00:01 |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6612_F53A566 |       |       |            |          |
|   3 |    TABLE ACCESS FULL                     | T2                         | 10000 |   322K|    27   (8)| 00:00:01 |
|*  4 |   HASH JOIN SEMI                         |                            |    48 |  6240 |    21  (15)| 00:00:01 |
|*  5 |    HASH JOIN SEMI                        |                            |    48 |  4992 |    13  (16)| 00:00:01 |
|   6 |     TABLE ACCESS BY INDEX ROWID BATCHED  | T1                         |    48 |  1200 |     4   (0)| 00:00:01 |
|*  7 |      INDEX RANGE SCAN                    | T1_I1                      |    48 |       |     2   (0)| 00:00:01 |
|*  8 |     VIEW                                 |                            | 10000 |   771K|     8  (13)| 00:00:01 |
|   9 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6612_F53A566 | 10000 |   322K|     8  (13)| 00:00:01 |
|* 10 |    VIEW                                  |                            | 10000 |   253K|     8  (13)| 00:00:01 |
|  11 |     TABLE ACCESS FULL                    | SYS_TEMP_0FD9D6612_F53A566 | 10000 |   322K|     8  (13)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$A317D234
   2 - SEL$1
   3 - SEL$1        / GTT1@SEL$1
   6 - SEL$A317D234 / T1@SEL$2
   7 - SEL$A317D234 / T1@SEL$2
   8 - SEL$D67CB2D2 / V1B@SEL$4
   9 - SEL$D67CB2D2 / T1@SEL$D67CB2D2
  10 - SEL$D67CB2D3 / V1A@SEL$3
  11 - SEL$D67CB2D3 / T1@SEL$D67CB2D3

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("V1A"."OBJECT_ID"="T1"."OBJECT_ID")
   5 - access("V1B"."OBJECT_ID"="T1"."OBJECT_ID")
   7 - access("T1"."OBJECT_ID">=100 AND "T1"."OBJECT_ID"<=200)
   8 - filter("V1B"."OBJECT_NAME" LIKE 'WRI%' AND "V1B"."OBJECT_ID">=100 AND "V1B"."OBJECT_ID"<=200)
  10 - filter("V1A"."OBJECT_TYPE"='TABLE' AND "V1A"."OBJECT_ID">=100 AND "V1A"."OBJECT_ID"<=200)

As you can see, the factored subquery (operations 2 and 3) has the query block name of sel$1 and the main query (operations 6 an 7 where the real t1 is used) has the query block name sel$2. So without giving the subquery a name I could have used the hint /*+ inline(@sel$1) */ in the main query block.

This takes us on to the second point that needs investigation. If you’ve looked at the example on the Oracle Developer Forum you will have seen that there’s an SQL statement that references a stored view and the factored subquery of interest is defined in the view. This means we might be able to edit the query that calls the view to include a hint referencing the query block inside the view – but then what do we do if we can’t edit the main query itself?

To be investigated (1) – would the inline() hint with nominated query block work if the factored subquery was inside a stored view that we were using in our query?

To be investigated(2) – if (1) works, could we achieve the same result by using an SQL Patch to attach the hint to the main query text without editing the main query?

Update (Oct 2020)

It turns out that I discovered this enhancement a few months ago while doing some experimentation with recursive subquery factoring.

Update Nov 2020

A blog note from Nenad Noveljic warns of a surprising ORA-07445 if you get too trigger-happy with the inline() and materialize() hints.

October 8, 2020

Direct Path

Filed under: Infrastructure,Oracle,Performance,Troubleshooting,Tuning,Upgrades — Jonathan Lewis @ 12:29 pm BST Oct 8,2020

This is a little addendum to a note I wrote a couple of days ago about serial direct path reads and KO (fast object checkpoint) enqueue waits.

The original note was prompted by a problem where someone had set the hidden parameter “_serial_direct_read” to ‘always’ because they were running 11g and wanted some “insert as select” statements to use direct path reads on the select portion – and 11g wasn’t co-operating.

Serial direct path reads were introduced as a possibility in (at least) the 8.1.7.4 timeline, but the parameter was set to false until 11gR2 where it changed to auto. (Legal values are: true, false, always, never, auto)

In 11.2, though, even though a simple select statement could use serial direct path reads for segment scans, Oracle would not use the mechanism for “insert as select”.

This note is just a little piece of code to demonstrate the point.  Run it on 11g and (unless your buffer cache is large enough to make the test table “small”) Oracle will use direct path reads on the pure select, but scattered reads for the insert as select. Upgrade to 12.1 and Oracle will use direct path reads on both.

rem
rem     Script:         serial_fail.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem

create table t1
as
select
        ao.*
from
        all_objects     ao,
        (select rownum from dual connect by level <= 16) mult
/

create table t2
as
select  *
from    t1
where   rownum = 0
/

alter system flush buffer_cache;

prompt  =============
prompt  Simple Select
prompt  =============

execute snap_events.start_snap
select * from t1 where object_id = 98765;
execute snap_events.end_snap

prompt  ================
prompt  Insert as select
prompt  ================

execute snap_events.start_snap
insert into t2
select * from t1 where object_id = 98765;
execute snap_events.end_snap

prompt  =====================
prompt  Insert as select with
prompt  _serial_direct=always
prompt  =====================

alter session set "_serial_direct_read"=always;

execute snap_events.start_snap
insert /* serial direct */ into t2
select * from t1 where object_id = 98765;
execute snap_events.end_snap

alter session set "_serial_direct_read"=auto;

The calls to the snap_events package are the to produce the change in v$session_event for my session during the SQL.

You’ll notice I’ve included three main SQL statements rather than two – the third statement (2nd execution of the insert) is to demonstrate that it is possible to get direct path reads on the insert by setting the hidden parameter to ‘always’.

One detail to remember when testing this particular feature (and the same guideline applies to some other features), the “direct / not direct” becomes an attribute of the cursor, it’s not an attribute of the execution plan. This is why I’ve added a comment to the 2nd insert; if I hadn’t done so Oracle would have reused the (identical text) cursor from the first insert, which would have resulted in scattered reads being used instead of direct path reads. This distinction between cursor attributes and plan attributes explains why there is no hint that will allow you to force direct path reads for a specific query (not even the infamous opt_param() hint). [Ed: in response to a recent retweet for this note Roger MacNicol pointed out that any cursor attribute is treated as a hint unless forced (I presume that’s “natural language” hint rather than the CBO meaning of hint) so the actual mechanism could still have been decided at the next runtime.]

Here are the three sets of results from a system running 11.2.0.4:

=============
Simple Select
=============

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                               1           0           0.10        .100           4
direct path read                                    114           0          20.86        .183           6
SQL*Net message to client                             4           0           0.00        .000           0
SQL*Net message from client                           4           0           0.11        .028     174,435

================
Insert as select
================

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                              22           0           0.60        .027           4
db file scattered read                              130           0          35.97        .277           5
SQL*Net message to client                             4           0           0.01        .002           0
SQL*Net message from client                           4           0           0.10        .025     174,435

=====================
Insert as select with
_serial_direct=always
=====================

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
direct path read                                    108           0          17.44        .161           6
SQL*Net message to client                             4           0           0.00        .000           0
SQL*Net message from client                           4           0           0.09        .022     174,435


Note the db file scattered read waits in the mddle test. If you re-run the test on 12.1.0.x (or later) you’ll find that the middle set of results will change to direct path read waits.

For reference, this limitation is covered by MOS note13250070.8: Bug 13250070 – Enh: Serial direct reads not working in DML. The actual bug note is not public.

Footnote (a couple of hours later):

A response from Roger MacNicol to my publication tweet has told us that the bug note says only that direct path reads had been restricted unnecessarily and the restriction has been removed.

October 5, 2020

Direct Path

Filed under: Oracle,Performance,Problem Solving,RAC,Troubleshooting,Tuning — Jonathan Lewis @ 11:29 am BST Oct 5,2020

Here’s a note that I might have written once already – but I can’t find it and I’ve just been reminded about what it (might have) said by a posting that came up on the Oracle database forum in the last few days.

The posting in question is asking why, after setting the hidden parameter _serial_direct_read to ‘always’ a particular query is now taking hours to complete when it used to complete in a minute or so.

The answer is partly “because you’ve forced direct path serial reads”, partly “because you’re running on RAC” and (most directly) because the optimizer is using a really bad execution plan for that query and the direct path reads have had a massive impact as a consequence. (It turns out, after modelling, that the answer might also include “because you’re running 11.2.0.4”)

I’m going to demonstrate the issue by forcing a very simple query to take a very bad execution plan.

rem
rem     Script:         ko.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem 

create table t1
as
select  *
from    all_objects
where   rownum <= 10000  -- > comment to avoid wordpress format issue
;

create table t2
as
select  *
from    all_objects
where   rownum <= 10000  -- > comment to avoid wordpress format issue
;

select  table_name, blocks 
from    user_tables
where   table_name in ('T1','T2')
;

alter system flush buffer_cache;

execute snap_events.start_snap
execute snap_my_stats.start_snap
alter session set "_serial_direct_read"=always;


select
        /*+ 
                leading(t1 t2)
                use_nl(t2)
                full(t2)
        */
        t1.object_type,
        t2.object_type
from
        t1, t2
where
        t2.object_id = t1.object_id + 0.5
;

execute snap_my_stats.end_snap
execute snap_events.end_snap

alter session set "_serial_direct_read"=auto;


My query very carefully ensures that it’s not going to return any rows; but it’s going to do a lot of work finding no data because I’ve forced Oracle into doing a tablescan of t2 for every row in t1 – so 10,000 scans of a table of 140 – 190 (depending on version) data blocks.

The snap_my_stats and snap_events packages are a couple of my simple diagnostic packages that allows me to find the change in some v$ content between the start and end snapshots. In this case it’s v$mystat and v$session_event for the session.

On a test using 11.2.0.4 the query ran for about 41 seconds with the following wait events reported:

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                               2           0           0.11        .054           1
direct path read                                  6,527           0         368.30        .056           6
SQL*Net message to client                            13           0           0.00        .000           0
SQL*Net message from client                          13           0      10,689.26     822.251      10,689

The 6,500 “direct path read” waits corresponded to 1.33M “physical reads direct” reported in the session activity stats. Although the t2 table was fairly small Oracle was forced to use direct path reads for every single cycle through the nested loop. As a quick comparison, here are the figures if I don’t force direct path serial scans.

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
db file sequential read                               2           0           0.09        .047           2
db file scattered read                               34           0           2.45        .072           0
SQL*Net message to client                            11           0           0.01        .001           0
SQL*Net message from client                          11           0         174.36      15.851      82,849



We’ve waited for only 34 “db file scattered reads” and 2.45 centiseconds as we read the t2 (and the t1) tables into the cache for the first time, but then we’ve been able to revisit the blocks in the cache. We also saw a reduction in CPU usage and the total run time dropped from 41 seconds to about 22 seconds.

In this tiny example it hasn’t made a staggering difference to the overall run time, but the OP wasn’t that lucky with his “couple of minutes” to “hours”.

If you look at the fragment of the SQL Monitor report supplied by the OP you’ll see that they have an operation which shows:

=====================================================================================================================================================================================================================================
| Id    |                  Operation                    |            Name              | Rows    | Cost |  Time     |  Start | Execs |    Rows  | Read | Read  | Cell   | Mem | Activity |            Activity Detail               |
|       |                                               |                              | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Offload |    |    (%)   |              (# samples)                 |
=====================================================================================================================================================================================================================================
| -> 25 |       INDEX STORAGE FAST FULL SCAN            | TMRC_IX1                     |      1  |      |     10745 |     +0 |  268K |     112K | 267K |   2GB |  96.53% | 1M |    99.74 | enq: KO - fast object checkpoint (4783)  |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | Cpu (753)                                |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | gcs drm freeze in enter server mode (25) |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | latch free (1)                           |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | reliable message (3505)                  |
|       |                                               |                              |         |      |           |        |       |          |      |       |         |    |          | cell smart index scan (1635)             |
=====================================================================================================================================================================================================================================

The “index fast full scan” is an example of a “segment scan” and will be subject to direct path reads, just like a tablescan. We don’t really know how big this index is – but we can see that we have read it 268K times (Execs column) reading approximately 2GB after 267K read requests. This suggests the index is at most 1MB in size since it can be read in a single read request, and may consist of only one populated block (2,000,000,000/267,000 = 7,490 bytes. Despite this small size the total time sampled for all these scans is more than 10,600 seconds – roughly 39 millisecond per scan. That’s not very good.

Looking more closely at the sampled time we notice 3 key features:

============================================
|            Activity Detail               |
|              (# samples)                 |
============================================
| enq: KO - fast object checkpoint (4783)  |
| Cpu (753)                                |
| gcs drm freeze in enter server mode (25) |
| latch free (1)                           |
| reliable message (3505)                  |
| cell smart index scan (1635)             |
============================================
  • A massive fraction of the time was spent on “enq: KO – fast object checkpoint”
  • A large fraction of the time was spent on “reliable message”
  • There was some time (relatively small, but large in absolute terms) for “gcs drm freeze …”

The last of these three is about global cache services, distributed resource manager” and is a clear indication that the system is running RAC and we have some hefty competition for “object mastering” between instances. But the actual time lost there is relatively small – though associated chatter between instances could be significant.

But what’s the “KO enqueue”? Every time an instance starts a direct path segment scan it has to get a message to the database writer (possibly via the checkpoint process) – hence the “reliable message” waits – to copy every dirty block for that segment from the buffer cache down to disc and it has to wait for the write to complete. This is necessary to ensure that the tablescan doesn’t miss any changes that have been made in memory without yet being written to disc.

The KO enqueue synchronises this activity – I haven’t worked out the complete chain of events, but the enqueue is negotiated between the session and the checkpoint process – and if you’re running RAC every instance has to write any dirty blocks it is holding for the segment, so you have to have a degree of cross-instance chatter to make this happen.

Thanks to the enforced serial direct reads the OP’s plan – which, surely, shouldn’t expect to do 267K index fast full scans – has a massive overhead thanks to the need for the repeated object checkpoints.

You may ask, at this point, why I didn’t see any KO enqueue waits in my test results – the answer is simple, I’d flushed the buffer cache before I started the test, so there were no dirty blocks for the session to worry about. Let’s see what happens if I introduce a little activity to keep dirtying a few blocks in the t2 table. Here’s a little loop that will update a few rows once per second:

begin
        for i in 1..1 loop
                update t2 set data_object_id = 0 where mod(object_id,1000) = i;
                dbms_output.put_line(sql%rowcount);
                commit;
                dbms_lock.sleep(1);
        end loop;
end;
/

You’ll notice the code runs through the loop just once – I started with a loop count of 30, and discovered it wasn’t necessary, but the option remains for further testing.

If I execute this anonymous block (which updates about 10 rows each time through the loop) from another session just after SQL*Plus reports my “alter session”, so that it runs just after the query starts, this is what the session event report looks like if I run the test against 11.2.0.4 (the version reported by the OP):

Event                                             Waits   Time_outs           Csec    Avg Csec    Max Csec
-----                                             -----   ---------           ----    --------    --------
enq: KO - fast object checkpoint                  7,645           0         290.34        .038           6
db file sequential read                               2           0           0.13        .066           2
direct path read                                 10,714           0         675.50        .063           6
SQL*Net message to client                            14           0           0.00        .000           0
SQL*Net message from client                          14           0       1,556.23     111.160     101,653
events in waitclass Other                         5,607           0         218.04        .039           8


Suddenly we see a lot of time spent on the KO enqueue waits and the “events in waitclass Other” (which turn out to be “reliable message” waits). Apparently the session keeps finding dirty t2 blocks in the cache and telling the database writer they need to be written to disc before the next tablescan of t2 can take place.

There’s something odd here, though and I’ll introduce it with this comment: when I repeated the test on 19.3 (even with with the constant trickle of updates once per second), we only see a tiny number of KO enqueues and reliable message waits – the fact that we see a huge number of them in 11g is a defect in the design of the code.

Think about what’s happening with the KO enqueue: when you start the first tablescan of t2 you force every dirty block for that segment to be copied from the cache to the disc.

As the tablescan proceeds you may have to apply some undo change vectors to the blocks you’re reading to take them back to the SCN as at the start of query execution, but you know that any data that had been committed before the query started is on disc (even if it has been over-written by committed changes made after the query started, or by uncommitted changes made before the query started). What’s on the disc right now will be usable to get the correct read-consistent version of the data for the duration of the query run, no matter how many newer changes are made, whether or not they over-write the disc blocks before the query ends. There is no need to force write any dirty blocks as the tablescan is repeated and, it seems, by 19.3 the code has been adjusted to accomodate that fact.

Footnote

After I had posted this comment on the forum, the OP raised the question of whether or not the fix might apply to 12c as well – so I ran up a VM of 12.1.0.2 and 12.2.0.1 and re-ran the tests. The results were initially promising – neither version reported an extreme number of KO enqueue waits or reliable message waits.

However when I changed the loop counter from 1 back to 30 I found that the waits re-appeared – though the numbers were significantly less than those from 11g – so perhaps there’s a timing element involved that might need further investigation and stress testing even for 19.3.

Footnote 2

Another snapshot I took in testing was from v$enqueue stat – which showed that (approximately) for every KO enqueue wait my session reported, the instance reported about 10 – 12 KO enqueue requests.

When checking v$enqueue_stat it’s important to remember that session activity stats (v$sesstat) report both “enqueue requests” and “enqueue conversions”. In v$enqueue_stat the conversions aren’t reported separately the view simply adds the two figures together under “requests”.

In the 11g test the session reported 7,645 KO enqueue waits, but the session activity stats reported 19,399 enqueue requests and 38,796 enqueue conversions; while v$enqueue_stat reported 96,990 KO enqueue requests. The remaining 38,796 KO enqueue requests were made by the checkpoint process (CKPT) – and it was only my session that repoted any waits for KO enqueue requests.

Without further low-level investigation this is what leads me to believe that the session sends CKPT a message that it wants an object-level checkpoint performed and waits for the message to be acknowledged (reliable message) before trying to convert a low-level KO enqueue to an exclusive one. But CKPT has acquired and converted the same KO enqueue before acknowledging the message from the session and will only release the enqueue when the database writer acknowledges that the checkpoint request has been completed. So this mechanism ensures that the session will have to wait until the checkpoint is complete and CKPT has released the enqueue before it can continue processing.

(P.S. If you enable event 10704 you will find that there seem to be two KO enqueues involved – one with id2 = 1, one with id2 = 2.)

September 9, 2020

Bloom Upgrade

Filed under: 18c,CBO,Joins,Oracle,Tuning,Upgrades — Jonathan Lewis @ 5:18 pm BST Sep 9,2020

It’s a common pattern of Oracle features that they start with various restrictions or limitations that disappear over time. This note is about an enhancement to Bloom filter processing that appeared in the 18.1 optimizer and, for some people, may be a good enough reason for upgrading to a newer version of Oracle. This enhancement came to my attention by way of the Oracle Developer forum in a thread with the title Bloom filters and view using UNION ALL asking how to get a Bloom filter pushed inside a UNION ALL view. The original requirement wasn’t a trivial one so I’ll demonstrate the problem with a very simple example – first the data set:

rem
rem     Script:         bloom_pushdown.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2020
rem
rem     Last tested:
rem             19.3.0.0
rem

create table t1 as select * from all_objects where rownum <= 50000;
create table t2 as select t1.* from t1, (select rownum n1 from dual connect by level <= 4);
create table t3 as select t1.* from t1, (select rownum n1 from dual connect by level <= 4); -- > comment to avoid wordpress format issue

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1 for columns object_type size 254'
        );
end;
/

I’ve been a bit lazy here, copying data from view all_objects. I’ve gathered stats on t1 so that I can generate a histogram on the object_type column because I’m going to query for a rare object_type and I want the optimizer to get a reasonable estimate of rows. I’m going to hint a parallel query to join t1 to t2 (aliased, trivially, as v1 for reasons that will become apparent soon):

select
        /*+ 
                parallel(2) 
        */
        t1.object_name, v1.object_name
from
        t1,
        t2 v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

In my case the optimizer chooses to do a hash join between these two table, and creates a Bloom filter to try and minimise the data passing through the data flow operation. The result set in my 12.2.0.1 database is only 16 rows, so it would be nice if the parallel scan could eliminate most of the 200,000 rows in t2 early – here’s the execution plan pulled from memory after running the query with rowsource execution stats enabled:


----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |      1 |        |   371 (100)|     16 |00:00:00.06 |      20 |      0 |       |       |          |
|   1 |  PX COORDINATOR        |          |      1 |        |            |     16 |00:00:00.06 |      20 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)  | :TQ10000 |      0 |     16 |   371   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN           |          |      2 |     16 |   371   (5)|     16 |00:00:00.05 |    6278 |   3988 |  1250K|  1250K|     2/0/0|
|   4 |     JOIN FILTER CREATE | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|   6 |     JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.04 |    4244 |   3988 |       |       |          |
|   7 |      PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.04 |    4244 |   3988 |       |       |          |
|*  8 |       TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|     16 |00:00:00.03 |    4244 |   3988 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   5 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
   8 - access(:Z>=:Z AND :Z<=:Z)
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"V1"."OBJECT_ID"))

We see that Oracle has generated a Bloom filter at operation 4 from the data returned from t1 at operation 5, and then used that Bloom filter at operation 6 to eliminate most of the data from t2 before passing the remaining few rows up to the hash join.

Let’s make the query more interesting – what if you want to use a UNION ALL of t2 and t3 in the query (for example one might be “current data” while the other is “historic data”. Here’s the query and plan from 12.2.0.1:

select
        /*+ 
                parallel(2) 
        */
        t1.object_name, v1.object_name
from
        t1,
        (select * from t2 union all select * from t3) v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |      1 |        |   667 (100)|     32 |00:00:00.37 |      40 |      0 |       |       |          |
|   1 |  PX COORDINATOR         |          |      1 |        |            |     32 |00:00:00.37 |      40 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)   | :TQ10000 |      0 |     32 |   667   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN            |          |      1 |     32 |   667   (5)|     32 |00:00:00.34 |    5125 |   3860 |  1250K|  1250K|     2/0/0|
|*  4 |     TABLE ACCESS FULL   | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    2034 |      0 |       |       |          |
|   5 |     VIEW                |          |      2 |    400K|   584   (4)|    400K|00:00:00.52 |    8488 |   7976 |       |       |          |
|   6 |      UNION-ALL          |          |      2 |        |            |    400K|00:00:00.24 |    8488 |   7976 |       |       |          |
|   7 |       PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|    200K|00:00:00.11 |    4244 |   3988 |       |       |          |
|*  8 |        TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|    200K|00:00:00.07 |    4244 |   3988 |       |       |          |
|   9 |       PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|    200K|00:00:00.11 |    4244 |   3988 |       |       |          |
|* 10 |        TABLE ACCESS FULL| T3       |     32 |    200K|   292   (4)|    200K|00:00:00.03 |    4244 |   3988 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   4 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
   8 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
  10 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue

No Bloom filter – so all 400,000 rows feed up the plan and through the hash join. This won’t matter too much for my sub-second tiny data set but on a pair of 50GB tables, with the potential to offload the Bloom filter to storage in Exadata and, perhaps, eliminate 99% of the data at the cell servers, this could make a huge difference to performance.

Since Bloom filters are all about hashing data (in Oracle the standard Bloom filter is the bitmap summarising the build table in a hash join) let’s trying pushing the optimizer into a hash distribution for the parallel join to see if that had any effect:


select
        /*+ 
                parallel(2) 
                gather_plan_statistics
                leading(@sel$1 t1@sel$1 v1@sel$1)
                use_hash(@sel$1 v1@sel$1)
                pq_distribute(@sel$1 v1@sel$1 hash hash)
        */
        t1.object_name, v1.object_name
from
        t1,
        (select * from t2 union all select * from t3) v1
where
        t1.object_type = 'SCHEDULE'
and     v1.object_id = t1.object_id
/

---------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
---------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |      1 |        |   667 (100)|     32 |00:00:00.43 |      60 |      0 |       |       |          |
|   1 |  PX COORDINATOR             |          |      1 |        |            |     32 |00:00:00.43 |      60 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)       | :TQ10002 |      0 |     32 |   667   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN BUFFERED       |          |      1 |     32 |   667   (5)|     32 |00:00:00.38 |    4000 |   3752 |  2290K|  2082K|     2/0/0|
|   4 |     JOIN FILTER CREATE      | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |       0 |      0 |       |       |          |
|   5 |      PX RECEIVE             |          |      2 |      4 |    75   (4)|      8 |00:00:00.01 |       0 |      0 |       |       |          |
|   6 |       PX SEND HYBRID HASH   | :TQ10000 |      0 |      4 |    75   (4)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|   7 |        STATISTICS COLLECTOR |          |      2 |        |            |      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|   8 |         PX BLOCK ITERATOR   |          |      2 |      4 |    75   (4)|      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|*  9 |          TABLE ACCESS FULL  | T1       |     26 |      4 |    75   (4)|      4 |00:00:00.01 |    1517 |      0 |       |       |          |
|  10 |     PX RECEIVE              |          |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  11 |      PX SEND HYBRID HASH    | :TQ10001 |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  12 |       JOIN FILTER USE       | :BF0000  |      2 |    400K|   584   (4)|     66 |00:00:00.77 |    8488 |   7976 |       |       |          |
|  13 |        VIEW                 |          |      2 |    400K|   584   (4)|    400K|00:00:00.68 |    8488 |   7976 |       |       |          |
|  14 |         UNION-ALL           |          |      2 |        |            |    400K|00:00:00.59 |    8488 |   7976 |       |       |          |
|  15 |          PX BLOCK ITERATOR  |          |      2 |    200K|   292   (4)|    200K|00:00:00.18 |    4244 |   3988 |       |       |          |
|* 16 |           TABLE ACCESS FULL | T2       |     32 |    200K|   292   (4)|    200K|00:00:00.06 |    4244 |   3988 |       |       |          |
|  17 |          PX BLOCK ITERATOR  |          |      2 |    200K|   292   (4)|    200K|00:00:00.12 |    4244 |   3988 |       |       |          |
|* 18 |           TABLE ACCESS FULL | T3       |     32 |    200K|   292   (4)|    200K|00:00:00.08 |    4244 |   3988 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   9 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue 
       filter("T1"."OBJECT_TYPE"='SCHEDULE') 
  16 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue
  18 - access(:Z>=:Z AND :Z<=:Z)   -- > edit to avoid wordpress format issue

We’ve managed to introduce a Bloom filter (which is visible as :BF0000 in the plan, even through there’s no reference to sys_op_bloom_filter() in the predicate information) but there’s a problem, we’re still passing 400,000 rows up the plan and the Bloom filter is only being applied at (or just after) the VIEW operator, discarding all but 66 rows before doing the hash join. It’s an improvement but not ideal; we’d like to see the Bloom filter applied to each of the two tables separately to eliminate rows as early as possible.

This can’t be done in 12.2, and you’d have to rewrite the query, changing a “join with union” into a “union of joins”, and that’s not really a desirable strategy.

Next Steps

Searching MOS, though you will be able to find the following note:

Doc ID 18849313.8 – ENH : bloom filters/pruning are pushed through union-all view

There’s an enhancement request to do what we want in 18.1, and the enhancement has got into the software. Here’s the (unhinted) plan from 19.3 (the plan stays the same when optimizer_features_enable is set back to 18.1.0, but drops back to the 12.1. plan when OFE is set to 12.2.0.1):

------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem |  O/1/M   |
------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |      1 |        |   666 (100)|     32 |00:00:00.11 |      10 |      0 |       |       |          |
|   1 |  PX COORDINATOR          |          |      1 |        |            |     32 |00:00:00.11 |      10 |      0 | 73728 | 73728 |          |
|   2 |   PX SEND QC (RANDOM)    | :TQ10000 |      0 |     32 |   666   (5)|      0 |00:00:00.01 |       0 |      0 |       |       |          |
|*  3 |    HASH JOIN             |          |      2 |     32 |   666   (5)|     32 |00:00:00.05 |   10020 |   7958 |  1250K|  1250K|     2/0/0|
|   4 |     JOIN FILTER CREATE   | :BF0000  |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    1998 |      0 |       |       |          |
|*  5 |      TABLE ACCESS FULL   | T1       |      2 |      4 |    75   (4)|      8 |00:00:00.01 |    1998 |      0 |       |       |          |
|   6 |     VIEW                 |          |      2 |    400K|   583   (4)|     32 |00:00:00.04 |    8022 |   7958 |       |       |          |
|   7 |      UNION-ALL           |          |      1 |        |            |     12 |00:00:00.02 |    4011 |   3979 |       |       |          |
|   8 |       JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|   9 |        PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|* 10 |         TABLE ACCESS FULL| T2       |     32 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|  11 |       JOIN FILTER USE    | :BF0000  |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|  12 |        PX BLOCK ITERATOR |          |      2 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
|* 13 |         TABLE ACCESS FULL| T3       |     32 |    200K|   292   (4)|     16 |00:00:00.02 |    4011 |   3979 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - access("V1"."OBJECT_ID"="T1"."OBJECT_ID")
   5 - filter("T1"."OBJECT_TYPE"='SCHEDULE')
  10 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T2"."OBJECT_ID")) 
  13 - access(:Z>=:Z AND :Z<=:Z)      -- > edit to avoid wordpress format issue
       filter(SYS_OP_BLOOM_FILTER(:BF0000,"T3"."OBJECT_ID"))

As you can see, we create a Bloom filter at operation 4, and use it twice at operations 8 and 11 – with the sys_op_bloom_filter() functions clearly visible in the predicate information showing us that the Bloom filter is applied to the object_id column in both cases.

If you want to disable this enhancement for some reasons there are two hidden parameters available (which you might set for a single query using the opt_param() hint):

  • _bloom_filter_setops_enabled = true
  • _bloom_pruning_setops_enabled = true

The first is for Bloom filters in the situation shown, I assume the second deals with Bloom filters for partition pruning.

Summary

In versions prior to 18.1 the optimizer is unable to push Bloom filters down to the individual tables in a UNION ALL view, but this limitation was removed in the 18.1 code set.

 

July 15, 2020

Fetch First vs. Rownum

Filed under: Execution plans,Oracle,Partitioning,Performance,Problem Solving,Tuning — Jonathan Lewis @ 10:11 am BST Jul 15,2020

I’ve pointed out fairly frequently that if you’re running Standard Edition but would like to take advantage of a few features of the Partitioning option then you might be able to do something appropriate with Partition Views (but I’ve just discovered while searching my blog for a suitable item to link to that I haven’t published any of my PV notes on the blog).

I’ve also pointed out that while 12c allows you to use “fetch first N rows” instead of “where rownum <= N” there’s a hidden threat to using the feature because “fetch first N” turns into a hidden row_number() over() analytic function.

Today’s note is a combination of these two topics, prompted by a request to solve a particular coding issue that has appeared a few times on the Oracle Developer Forum and is likely to be recognisable to a number of people.

I’ll start with a very simple model demonstrateing the simplest use of “fetch first N”:


rem
rem     Script:         fetch_first_union.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jul 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table t_odd
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        'O'                             flag,
        chr(65 + mod(rownum,26))        class,
        2 * rownum - 1                  id,
        lpad(2 * rownum,10,'0')         v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;

alter table t_odd modify(flag not null, class not null, id not null);
alter table t_odd add constraint to_chk_odd check (flag = 'O');

create index to_i1 on t_odd(class, id);

With this data set I want to write a query that selects rows for class A where id > 9500, ordered by id – but I only want the first two rows. Here’s a very simple query that gets the result I want, followed by the execution plan from 12.2.0.1 (the A-Rows and E-Rows from 19.3 are slightly different):

set serveroutput off
set linesize 180

alter session set statistics_level = all;

select  /*+ index(t_odd (class, id)) */
        *
from
        t_odd
where
        class = 'A'
and     id > 9500
order by
        class, id
fetch 
        first 2 rows only
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));

--------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |    95 (100)|      2 |00:00:00.01 |       6 |
|*  1 |  VIEW                         |       |      1 |      2 |    95   (0)|      2 |00:00:00.01 |       6 |
|*  2 |   WINDOW NOSORT STOPKEY       |       |      1 |    202 |    95   (0)|      2 |00:00:00.01 |       6 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T_ODD |      1 |    202 |    95   (0)|      3 |00:00:00.01 |       6 |
|*  4 |     INDEX RANGE SCAN          | TO_I1 |      1 |    202 |     2   (0)|      3 |00:00:00.01 |       3 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=2)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "T_ODD"."CLASS","T_ODD"."ID")<=2)
   4 - access("CLASS"='A' AND "ID">9500)

The plan – with its stats – shows us that we’ve done an index range scan of an index which will return the data in exactly the order we want, and the “fetch first 2 rows” has been translated into the row_number() over() that we expected; but to our great joy the “window sort stopkey” makes the processing stop very early because Oracle recognises that the base data is arriving in the right order so it isn’t necessary to fetch all of it and sort it. The A-Rows column confirms this interpretation of what has happened.

You might notice, by the way, that the optimizer has costed the query as if it were fetching all the rows even though it “knows” that it’s going to fetch only the first two rows. That’s why I had to include the index hint to make the optimizer use the obvious index – a popular alternative is to use the /*+ first_rows(N) */ hint where N matches, or is similar to, the number of rows you want to fetch. If I had omitted the hint the optimizer would have done a full tablescan and then applied a “window sort pushed rank” operation to sort and limit the result to 2 rows.

So now we come to the real problem: the user has a “current” table and an identical “history” table, and would like to replace the table reference with a reference to a union all view for their clients on Standard Edition, or to a partitioned table for clients running Enterprise Edition – and they don’t really want to do any other code changes. So let’s see what happens when we model the union all. I started with a table called t_odd that held only odd values for id, so I’m going to add a table called t_even that holds only even values for id.

create table t_even
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        'E'                             flag,
        chr(65 + mod(rownum,26))        class,
        2 * rownum                      id,
        lpad(2 * rownum,10,'0')         v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;

alter table t_even modify(flag not null, class not null, id not null);
alter table t_even add constraint te_chk_even check (flag = 'E');

create index te_i1 on t_even(class, id);

create or replace view v_bare 
as
select * from t_odd
union all
select * from t_even
/

select
        /*+ 
                index(vw.t_odd  (class, id)) 
                index(vw.t_even (class, id)) 
        */
        *
from
        v_bare vw
where
        class = 'A'
and     id > 9500
order by
        class, id
fetch 
        first 2 rows only
;

As you can see t_even is an identically structured table with similar data, and I’ve created a union all view on top of the two tables, changing the query to reference the view rather than referencing a table. Thanks to the costing anomaly (combined with the small size of the tables) I’ve had to supply a couple of “global” hints to tell the optimizer to use the indexes to access the two tables. So how well does the optimizer do its job when we have a union all view?


----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |        |      1 |        |   192 (100)|      2 |00:00:00.01 |     190 |       |       |          |
|*  1 |  VIEW                                   |        |      1 |      2 |   192   (2)|      2 |00:00:00.01 |     190 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK               |        |      1 |    404 |   192   (2)|      2 |00:00:00.01 |     190 |  2048 |  2048 | 2048  (0)|
|   3 |    VIEW                                 | V_BARE |      1 |    404 |   191   (1)|    404 |00:00:00.01 |     190 |       |       |          |
|   4 |     UNION-ALL                           |        |      1 |        |            |    404 |00:00:00.01 |     190 |       |       |          |
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| T_ODD  |      1 |    202 |    95   (0)|    202 |00:00:00.01 |      95 |       |       |          |
|*  6 |       INDEX RANGE SCAN                  | TO_I1  |      1 |    202 |     2   (0)|    202 |00:00:00.01 |       2 |       |       |          |
|   7 |      TABLE ACCESS BY INDEX ROWID BATCHED| T_EVEN |      1 |    202 |    95   (0)|    202 |00:00:00.01 |      95 |       |       |          |
|*  8 |       INDEX RANGE SCAN                  | TE_I1  |      1 |    202 |     2   (0)|    202 |00:00:00.01 |       2 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=2)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "VW"."CLASS","VW"."ID")<=2)
   6 - access("CLASS"='A' AND "ID">9500)
   8 - access("CLASS"='A' AND "ID">9500)

Answer: Bad luck, the optimizer isn’t smart enought to find a cheap way through this query. It’s fetched all the relevant data from the two tables before applying the window sort (which it does with some efficiency – the pushed rank) to produce the right answer. As you can see from the A-Rows column, though, it’s had to acquire a couple of hundred rows from each table before getting down to the 2 rows we wanted.

Partitioned Tables

So let’s try to solve the problem by buying into the partitioning option and creating a list-partitioned table with two partitions, one flagged for current data and one flagged for history data – or ‘O’dd and ‘E’ven data as I’ve created them in my model.



create table t_pt (
        flag,
        class,
        id,
        v1,
        padding
)
partition by list (flag) (
        partition pO values('O'),
        partition pE values('E')
)
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        'O'                             flag,
        chr(65 + mod(rownum,26))        class,
        2 * rownum - 1                  id,
        lpad(2 * rownum,10,'0')         v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;

insert into t_pt
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        'E'                             flag,
        chr(65 + mod(rownum,26))        class,
        2 * rownum                      id,
        lpad(2 * rownum,10,'0')         v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;


create index tp_i1 on t_pt(class, id) local;
alter table t_pt modify (flag not null, class not null, id not null);

execute dbms_stats.gather_table_stats(user,'t_pt',method_opt=>'for all columns size 1', cascade=>true, granularity=>'ALL')

Note particularly that I have created a local index on this partitioned table – so there’s a very close correspondance between the two tables in the previous example and the two partitions in this example. Here’s the plan when I query the partitioned table for the first two rows:


select  /*+ index(t_pt (class, id)) */
        *
from
        t_pt
where
        class = 'A'
and     id > 9500
order by
        class, id
fetch 
        first 2 rows only
;


--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |       |      1 |        |   190 (100)|      2 |00:00:00.01 |     189 |       |       |          |
|*  1 |  VIEW                                        |       |      1 |      2 |   190   (2)|      2 |00:00:00.01 |     189 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK                    |       |      1 |    404 |   190   (2)|      2 |00:00:00.01 |     189 |  2048 |  2048 | 2048  (0)|
|   3 |    PARTITION LIST ALL                        |       |      1 |    404 |   189   (1)|    404 |00:00:00.01 |     189 |       |       |          |
|   4 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T_PT  |      2 |    404 |   189   (1)|    404 |00:00:00.01 |     189 |       |       |          |
|*  5 |      INDEX RANGE SCAN                        | TP_I1 |      2 |    404 |     4   (0)|    404 |00:00:00.01 |       4 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=2)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "T_PT"."CLASS","T_PT"."ID")<=2)
   5 - access("CLASS"='A' AND "ID">9500)

The optimizer has let us down again. The plan shows us that we have to acquire all the relevant data from the two partitions before applying the row_number() analytic function and discarding all but the two rows we want. (Again we can check the A-Rows column to see that we have started by fetching a total of 404 rows from the table.)

But what happens if we fall back to the good old-fashioned (non-standard) rownum method:


select
        *
from    (
        select  /*+ index(t_pt (class, id)) */
                *
        from
                t_pt
        where
                class = 'A'
        and     id > 9500
        order by
                class, id
        )
where
        rownum <= 2
;

----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |       |      1 |        |   190 (100)|      2 |00:00:00.01 |       6 |       |       |          |
|*  1 |  COUNT STOPKEY                                 |       |      1 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|   2 |   VIEW                                         |       |      1 |    404 |   190   (2)|      2 |00:00:00.01 |       6 |       |       |          |
|*  3 |    SORT ORDER BY STOPKEY                       |       |      1 |    404 |   190   (2)|      2 |00:00:00.01 |       6 |  2048 |  2048 | 2048  (0)|
|   4 |     PARTITION LIST ALL                         |       |      1 |    404 |   189   (1)|      4 |00:00:00.01 |       6 |       |       |          |
|*  5 |      COUNT STOPKEY                             |       |      2 |        |            |      4 |00:00:00.01 |       6 |       |       |          |
|   6 |       TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T_PT  |      2 |    404 |   189   (1)|      4 |00:00:00.01 |       6 |       |       |          |
|*  7 |        INDEX RANGE SCAN                        | TP_I1 |      2 |    404 |     4   (0)|      4 |00:00:00.01 |       4 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=2)
   3 - filter(ROWNUM<=2)
   5 - filter(ROWNUM<=2)
   7 - access("CLASS"='A' AND "ID">9500)

Oracle really knows how to use rownum well – notice how there is a count stopkey operation as a child to the partition list all operation, and that’s where our rownum <= 2 predicate is first applied. For each partition Oracle finds “the first two rows” and after it has collected two rows from every partition it sorts them (again with a stopkey) to find the top two in that subset. Check the A-Rows column – we selected a total of 4 rows from the table (2 per partition) and reduced that to 2 rows at operation 3.

Conclusion

There are some patterns of processing where partitioned tables can be a lot friendlier to coders than partition views; if you do have to stick with Standard Edition you can usually get what you want but the coding investment may be significantly higher. Even with partitioned tables, though, there are some “old-fashioned” Oracle methods that do a much nicer job than some of the new-fangled “ANSI” mechanisms.

Footnote

Part of the problem presented here revolves around the desire to keep a pattern of SQL generation that already exists, doing nothing more than replacing a table name with a view (or partitioned table) name.

As we’ve seen, if you start with a simple heap table and try to replace it with a partitioned table you have to use the rownum mechanism rather than the fetch first N rows mechanism.

If you’re running with Standard Edition you can’t do anything simple to replace a table name with the name of a union all view; you’d have to change your code generator to apply all the predicates twice (once for each table) and then apply the rownum predicate or fetch first directive again outside the union all. In other words you have to emulate exactly what Oracle EE manages to do with partitioned tables and rownum.

select
        flag, class, id, v1
from
        (
        select
                /*+ index(t_odd (class, id)) */
                flag, class, id, v1
        from
                t_odd
        where
                class = 'A'
        and     id > 9500
        order by
                class, id
        fetch
                first 2 rows only
        )
        union all
        (
        select
                /*+ index(t_even (class, id)) */
                flag, class, id, v1
        from
                t_even
        where
                class = 'A'
        and     id > 9500
        order by
                class, id
        fetch
                first 2 rows only
        )
order by
        class, id
fetch 
        first 2 rows only
;

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |        |      1 |        |   192 (100)|      2 |00:00:00.01 |       8 |       |       |          |
|*  1 |  VIEW                              |        |      1 |      2 |   192   (2)|      2 |00:00:00.01 |       8 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK          |        |      1 |      4 |   192   (2)|      2 |00:00:00.01 |       8 |  2048 |  2048 | 2048  (0)|
|   3 |    VIEW                            |        |      1 |      4 |   191   (1)|      4 |00:00:00.01 |       8 |       |       |          |
|   4 |     UNION-ALL                      |        |      1 |        |            |      4 |00:00:00.01 |       8 |       |       |          |
|   5 |      VIEW                          |        |      1 |      2 |    95   (0)|      2 |00:00:00.01 |       4 |       |       |          |
|*  6 |       VIEW                         |        |      1 |      2 |    95   (0)|      2 |00:00:00.01 |       4 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY       |        |      1 |    202 |    95   (0)|      2 |00:00:00.01 |       4 | 73728 | 73728 |          |
|   8 |         TABLE ACCESS BY INDEX ROWID| T_ODD  |      1 |    202 |    95   (0)|      3 |00:00:00.01 |       4 |       |       |          |
|*  9 |          INDEX RANGE SCAN          | TO_I1  |      1 |    202 |     2   (0)|      3 |00:00:00.01 |       2 |       |       |          |
|* 10 |      VIEW                          |        |      1 |      2 |    95   (0)|      2 |00:00:00.01 |       4 |       |       |          |
|* 11 |       WINDOW NOSORT STOPKEY        |        |      1 |    202 |    95   (0)|      2 |00:00:00.01 |       4 | 73728 | 73728 |          |
|  12 |        TABLE ACCESS BY INDEX ROWID | T_EVEN |      1 |    202 |    95   (0)|      3 |00:00:00.01 |       4 |       |       |          |
|* 13 |         INDEX RANGE SCAN           | TE_I1  |      1 |    202 |     2   (0)|      3 |00:00:00.01 |       2 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("from$_subquery$_007"."rowlimit_$$_rownumber"<=2)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "from$_subquery$_006"."CLASS","from$_subquery$_006"."ID")<=2)
   6 - filter("from$_subquery$_003"."rowlimit_$$_rownumber"<=2)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "CLASS","ID")<=2)
   9 - access("CLASS"='A' AND "ID">9500)
  10 - filter("from$_subquery$_005"."rowlimit_$$_rownumber"<=2)
  11 - filter(ROW_NUMBER() OVER ( ORDER BY "CLASS","ID")<=2)
  13 - access("CLASS"='A' AND "ID">9500)


As you can see, the E-Rows still predicts a lot of work, but the A-Rows tells us the work was kept to the minimum we want.

 

July 13, 2020

Min/Max costing

Filed under: Oracle,Performance,subqueries,Tuning,Upgrades — Jonathan Lewis @ 1:07 pm BST Jul 13,2020

A question about the min/max index scan appeared on the Oracle Developer Community forum recently. The query supplied in the thread was a little odd – you might ask why anyone would run it as it stands – and I’ve modified it to make it even stranger to demonstrate a range of details.

I’ll start with a simple data set, not bothering to collect stats because that will be done automatically on create for my versions:

rem
rem     Script:         min_max_cost_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jul 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1 
as 
select  * 
from    all_objects 
where   rownum <= 50000 -- > comment to avoid wordpress format issue
;

create index t1_i1 on t1(object_name);

Now a few simple queries – for which I’ll capture and display the in-memory execution plans a little further on:


set linesize 156
set pagesize 60
set trimspool on
set serveroutput off
alter session set statistics_level = all;

prompt  =====================
prompt  Baseline select max()
prompt  =====================

select max(object_name) from t1;
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  ============================
prompt  select max() with dummy join
prompt  ============================

select max(object_name) from t1, dual where dummy is not null;
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  =============================================
prompt  select max() with dummy join and index() hint
prompt  =============================================

select /*+ index(t1) */  max(object_name) from t1, dual where dummy is not null;
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  ============================================
prompt  select max() with dummy join and inline view
prompt  ============================================

select  obj
from    (
        select  max(object_name)  obj
        from    t1
        ),
        dual 
where   dummy is not null
/

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last alias'));

prompt  ====================================
prompt  select max() with existence subquery
prompt  ====================================

select max(object_name) from t1 where exists (select null from dual where dummy is not null);
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last alias'));

prompt  ============================================
prompt  select max() with failing existence subquery
prompt  ============================================

select max(object_name) from t1 where exists (select null from dual where dummy is null);
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last alias'));

With 50,000 rows and the appropriate index to allow Oracle to find the maximum value very quickly we expect the optimizer to invoke the “index full scan (min/max)” operation, visiting only the extreme leaf block of the index – and, indeed, we are not disappointed, that’s exactly what the baseline query shows us:

=====================
Baseline select max()
=====================
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |       |      1 |        |     3 (100)|      1 |00:00:00.01 |       3 |
|   1 |  SORT AGGREGATE            |       |      1 |      1 |            |      1 |00:00:00.01 |       3 |
|   2 |   INDEX FULL SCAN (MIN/MAX)| T1_I1 |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
-----------------------------------------------------------------------------------------------------------

However, when we introduce the (as yet unjustified) join to dual something very different happens – the optimizer forgets all about the min/max optimisation and does an index fast full scan of the t1_i1 index, passing all 50,000 rows up to the parent operation.


============================
select max() with dummy join
============================
-------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |      1 |        |    50 (100)|      1 |00:00:00.02 |     360 |
|   1 |  SORT AGGREGATE        |       |      1 |      1 |            |      1 |00:00:00.02 |     360 |
|   2 |   NESTED LOOPS         |       |      1 |  50000 |    50   (6)|  50000 |00:00:00.01 |     360 |
|*  3 |    TABLE ACCESS FULL   | DUAL  |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|   4 |    INDEX FAST FULL SCAN| T1_I1 |      1 |  50000 |    48   (7)|  50000 |00:00:00.01 |     357 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("DUMMY" IS NOT NULL)

We could, of course, try hinting an index range (full) scan to see what happens – and the result is even more surprising: Oracle takes the hint, uses the min/max optimisation, and shows us that it didn’t take that path by default because it had “forgotten” how to cost it correctly.

Note the cost of 354 at operation 5 when the original min/max cost was 3, note also that the optimizer thinks we have to visit all 50,000 index entries even though, at run-time, Oracle correctly uses a path that visits only one index entry:


=============================================
select max() with dummy join and index() hint
=============================================
-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |   356 (100)|      1 |00:00:00.01 |       6 |
|   1 |  SORT AGGREGATE              |       |      1 |      1 |            |      1 |00:00:00.01 |       6 |
|   2 |   NESTED LOOPS               |       |      1 |  50000 |   356   (2)|      1 |00:00:00.01 |       6 |
|*  3 |    TABLE ACCESS FULL         | DUAL  |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|   4 |    FIRST ROW                 |       |      1 |  50000 |   354   (2)|      1 |00:00:00.01 |       3 |
|   5 |     INDEX FULL SCAN (MIN/MAX)| T1_I1 |      1 |  50000 |   354   (2)|      1 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("DUMMY" IS NOT NULL)

Of course we could recognise that the t1 access and the access to dual could be de-coupled – and hope that the optimizer doesn’t try to use complex view merging (maybe we should have included a /*+ no_merge */ hint) to fall back to a simple join. Fortunately the optimizer doesn’t try merging the two query blocks, so it optimises the max(object_name) query block correctly, giving us the benefit of the min/max optimisation. I’ve included the ‘alias’ format option in this call to dbms_xplan() so that we can see the two query blocks that are optimised separately.


============================================
select max() with dummy join and inline view
============================================

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |     5 (100)|      1 |00:00:00.01 |       6 |
|   1 |  NESTED LOOPS                |       |      1 |      1 |     5   (0)|      1 |00:00:00.01 |       6 |
|*  2 |   TABLE ACCESS FULL          | DUAL  |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|   3 |   VIEW                       |       |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
|   4 |    SORT AGGREGATE            |       |      1 |      1 |            |      1 |00:00:00.01 |       3 |
|   5 |     INDEX FULL SCAN (MIN/MAX)| T1_I1 |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   2 - SEL$1 / DUAL@SEL$1
   3 - SEL$2 / from$_subquery$_001@SEL$1
   4 - SEL$2
   5 - SEL$2 / T1@SEL$2

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DUMMY" IS NOT NULL)

There is a maxim (or guideline, or rule of thumb) that if the from clause of a query includes tables that don’t get referenced in the select list then those tables should (probably) appear in subqueries. Of course this guideline sometimes turns out to be a very bad idea, and sometimes it just means the optimizer unnests the subqueries and recreates the joins we started with, but let’s try the approach with this query. I’ve included the ‘alias’ option again so that you can see that this plan is optimised as two query blocks, allowing the max(object_name) query block to find the min/max strategy.


====================================
select max() with existence subquery
====================================
-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |     5 (100)|      1 |00:00:00.01 |       6 |
|   1 |  SORT AGGREGATE              |       |      1 |      1 |            |      1 |00:00:00.01 |       6 |
|*  2 |   FILTER                     |       |      1 |        |            |      1 |00:00:00.01 |       6 |
|   3 |    FIRST ROW                 |       |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
|   4 |     INDEX FULL SCAN (MIN/MAX)| T1_I1 |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
|*  5 |    TABLE ACCESS FULL         | DUAL  |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 | 
-------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   4 - SEL$1 / T1@SEL$1
   5 - SEL$2 / DUAL@SEL$2

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NOT NULL)
   5 - filter("DUMMY" IS NOT NULL)

There’s a very important detail in the execution plan above. At first sight it looks like the optimizer has a plan using a simple filter subquery operation – which means you might be fooled into reading it as “for each row returned by operation 3 call operation 5”. This is not the case.

Because the subquery is not a correlated subquery – it’s an example that I sometimes call a “fixed” or (slightly ambiguously) “constant” subquery – Oracle can execute it once and use the resulting rowsource to decide whether or not to call the main query. It’s a case where (if you didn’t realise the plan consisted of two separate query blocks) you would say that Oracle was calling the second child first.

To prove this point I’ve set up one last variation of the query – the “failed subquery” version – where my select from dual returns no rows. Check the numbers of Starts shown for each line of the plan:


============================================
select max() with failing existence subquery
============================================
-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |     5 (100)|      1 |00:00:00.01 |       3 |
|   1 |  SORT AGGREGATE              |       |      1 |      1 |            |      1 |00:00:00.01 |       3 |
|*  2 |   FILTER                     |       |      1 |        |            |      0 |00:00:00.01 |       3 |
|   3 |    FIRST ROW                 |       |      0 |      1 |     3   (0)|      0 |00:00:00.01 |       0 |
|   4 |     INDEX FULL SCAN (MIN/MAX)| T1_I1 |      0 |      1 |     3   (0)|      0 |00:00:00.01 |       0 |
|*  5 |    TABLE ACCESS FULL         | DUAL  |      1 |      1 |     2   (0)|      0 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   4 - SEL$1 / T1@SEL$1
   5 - SEL$2 / DUAL@SEL$2

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NOT NULL)
   5 - filter("DUMMY" IS NULL)

The filter at operation 3 calls operation 5 – the query against dual – which runs once returning no rows. The min/max scan of t1_i1 at operation 4 doesn’t run. Operation 5 was called before operation 4 was considered.

Finally

This brings us back to the question – why would anyone run a strange query like this.

Perhaps the answer is that it’s just a demonstration of one part of a more complex query and what we’re trying to do is say: “if a certain record exists in a control table then include some information from table X”.

This note tells us that if there’s a possibility of a min/max optimisation to find the data then we should avoid using a join, instead we should use a “fixed subquery” to check the control table, and maybe we’ll also have to write the part of our query that collects (or isn’t required to collect) the interesting bit of data as an inline view.

 

Next Page »

Website Powered by WordPress.com.