Oracle Scratchpad

July 13, 2020

Min/Max costing

Filed under: Oracle,Performance,subqueries,Tuning — Jonathan Lewis @ 1:07 pm BST Jul 13,2020

A question about the min/max index scan appeared on the Oracle Developer Community forum recently. The query supplied in the thread was a little odd – you might ask why anyone would run it as it stands – and I’ve modified it to make it even stranger to demonstrate a range of details.

I’ll start with a simple data set, not bothering to collect stats because that will be done automatically on create for my versions:

rem
rem     Script:         min_max_cost_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jul 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table t1 
as 
select  * 
from    all_objects 
where   rownum <= 50000 -- > comment to avoid wordpress format issue
;

create index t1_i1 on t1(object_name);

Now a few simple queries – for which I’ll capture and display the in-memory execution plans a little further on:


set linesize 156
set pagesize 60
set trimspool on
set serveroutput off
alter session set statistics_level = all;

prompt  =====================
prompt  Baseline select max()
prompt  =====================

select max(object_name) from t1;
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  ============================
prompt  select max() with dummy join
prompt  ============================

select max(object_name) from t1, dual where dummy is not null;
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  =============================================
prompt  select max() with dummy join and index() hint
prompt  =============================================

select /*+ index(t1) */  max(object_name) from t1, dual where dummy is not null;
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last'));

prompt  ============================================
prompt  select max() with dummy join and inline view
prompt  ============================================

select  obj
from    (
        select  max(object_name)  obj
        from    t1
        ),
        dual 
where   dummy is not null
/

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last alias'));

prompt  ====================================
prompt  select max() with existence subquery
prompt  ====================================

select max(object_name) from t1 where exists (select null from dual where dummy is not null);
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last alias'));

prompt  ============================================
prompt  select max() with failing existence subquery
prompt  ============================================

select max(object_name) from t1 where exists (select null from dual where dummy is null);
select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last alias'));

With 50,000 rows and the appropriate index to allow Oracle to find the maximum value very quickly we expect the optimizer to invoke the “index full scan (min/max)” operation, visiting only the extreme leaf block of the index – and, indeed, we are not disappointed, that’s exactly what the baseline query shows us:

=====================
Baseline select max()
=====================
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |       |      1 |        |     3 (100)|      1 |00:00:00.01 |       3 |
|   1 |  SORT AGGREGATE            |       |      1 |      1 |            |      1 |00:00:00.01 |       3 |
|   2 |   INDEX FULL SCAN (MIN/MAX)| T1_I1 |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
-----------------------------------------------------------------------------------------------------------

However, when we introduce the (as yet unjustified) join to dual something very different happens – the optimizer forgets all about the min/max optimisation and does an index fast full scan of the t1_i1 index, passing all 50,000 rows up to the parent operation.


============================
select max() with dummy join
============================
-------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |      1 |        |    50 (100)|      1 |00:00:00.02 |     360 |
|   1 |  SORT AGGREGATE        |       |      1 |      1 |            |      1 |00:00:00.02 |     360 |
|   2 |   NESTED LOOPS         |       |      1 |  50000 |    50   (6)|  50000 |00:00:00.01 |     360 |
|*  3 |    TABLE ACCESS FULL   | DUAL  |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|   4 |    INDEX FAST FULL SCAN| T1_I1 |      1 |  50000 |    48   (7)|  50000 |00:00:00.01 |     357 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("DUMMY" IS NOT NULL)

We could, of course, try hinting an index range (full) scan to see what happens – and the result is even more surprising: Oracle takes the hint, uses the min/max optimisation, and shows us that it didn’t take that path by default because it had “forgotten” how to cost it correctly.

Note the cost of 354 at operation 5 when the original min/max cost was 3, note also that the optimizer thinks we have to visit all 50,000 index entries even though, at run-time, Oracle correctly uses a path that visits only one index entry:


=============================================
select max() with dummy join and index() hint
=============================================
-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |   356 (100)|      1 |00:00:00.01 |       6 |
|   1 |  SORT AGGREGATE              |       |      1 |      1 |            |      1 |00:00:00.01 |       6 |
|   2 |   NESTED LOOPS               |       |      1 |  50000 |   356   (2)|      1 |00:00:00.01 |       6 |
|*  3 |    TABLE ACCESS FULL         | DUAL  |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|   4 |    FIRST ROW                 |       |      1 |  50000 |   354   (2)|      1 |00:00:00.01 |       3 |
|   5 |     INDEX FULL SCAN (MIN/MAX)| T1_I1 |      1 |  50000 |   354   (2)|      1 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("DUMMY" IS NOT NULL)

Of course we could recognise that the t1 access and the access to dual could be de-coupled – and hope that the optimizer doesn’t try to use complex view merging (maybe we should have included a /*+ no_merge */ hint) to fall back to a simple join. Fortunately the optimizer doesn’t try merging the two query blocks, so it optimises the max(object_name) query block correctly, giving us the benefit of the min/max optimisation. I’ve included the ‘alias’ format option in this call to dbms_xplan() so that we can see the two query blocks that are optimised separately.


============================================
select max() with dummy join and inline view
============================================

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |     5 (100)|      1 |00:00:00.01 |       6 |
|   1 |  NESTED LOOPS                |       |      1 |      1 |     5   (0)|      1 |00:00:00.01 |       6 |
|*  2 |   TABLE ACCESS FULL          | DUAL  |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|   3 |   VIEW                       |       |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
|   4 |    SORT AGGREGATE            |       |      1 |      1 |            |      1 |00:00:00.01 |       3 |
|   5 |     INDEX FULL SCAN (MIN/MAX)| T1_I1 |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   2 - SEL$1 / DUAL@SEL$1
   3 - SEL$2 / from$_subquery$_001@SEL$1
   4 - SEL$2
   5 - SEL$2 / T1@SEL$2

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("DUMMY" IS NOT NULL)

There is a maxim (or guideline, or rule of thumb) that if the from clause of a query includes tables that don’t get referenced in the select list then those tables should (probably) appear in subqueries. Of course this guideline sometimes turns out to be a very bad idea, and sometimes it just means the optimizer unnests the subqueries and recreates the joins we started with, but let’s try the approach with this query. I’ve included the ‘alias’ option again so that you can see that this plan is optimised as two query blocks, allowing the max(object_name) query block to find the min/max strategy.


====================================
select max() with existence subquery
====================================
-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |     5 (100)|      1 |00:00:00.01 |       6 |
|   1 |  SORT AGGREGATE              |       |      1 |      1 |            |      1 |00:00:00.01 |       6 |
|*  2 |   FILTER                     |       |      1 |        |            |      1 |00:00:00.01 |       6 |
|   3 |    FIRST ROW                 |       |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
|   4 |     INDEX FULL SCAN (MIN/MAX)| T1_I1 |      1 |      1 |     3   (0)|      1 |00:00:00.01 |       3 |
|*  5 |    TABLE ACCESS FULL         | DUAL  |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 | 
-------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   4 - SEL$1 / T1@SEL$1
   5 - SEL$2 / DUAL@SEL$2

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NOT NULL)
   5 - filter("DUMMY" IS NOT NULL)

There’s a very important detail in the execution plan above. At first sight it looks like the optimizer has a plan using a simple filter subquery operation – which means you might be fooled into reading it as “for each row returned by operation 3 call operation 5”. This is not the case.

Because the subquery is not a correlated subquery – it’s an example that I sometimes call a “fixed” or (slightly ambiguously) “constant” subquery – Oracle can execute it once and use the resulting rowsource to decide whether or not to call the main query. It’s a case where (if you didn’t realise the plan consisted of two separate query blocks) you would say that Oracle was calling the second child first.

To prove this point I’ve set up one last variation of the query – the “failed subquery” version – where my select from dual returns no rows. Check the numbers of Starts shown for each line of the plan:


============================================
select max() with failing existence subquery
============================================
-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |      1 |        |     5 (100)|      1 |00:00:00.01 |       3 |
|   1 |  SORT AGGREGATE              |       |      1 |      1 |            |      1 |00:00:00.01 |       3 |
|*  2 |   FILTER                     |       |      1 |        |            |      0 |00:00:00.01 |       3 |
|   3 |    FIRST ROW                 |       |      0 |      1 |     3   (0)|      0 |00:00:00.01 |       0 |
|   4 |     INDEX FULL SCAN (MIN/MAX)| T1_I1 |      0 |      1 |     3   (0)|      0 |00:00:00.01 |       0 |
|*  5 |    TABLE ACCESS FULL         | DUAL  |      1 |      1 |     2   (0)|      0 |00:00:00.01 |       3 |
-------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------
   1 - SEL$1
   4 - SEL$1 / T1@SEL$1
   5 - SEL$2 / DUAL@SEL$2

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter( IS NOT NULL)
   5 - filter("DUMMY" IS NULL)

The filter at operation 3 calls operation 5 – the query against dual – which runs once returning no rows. The min/max scan of t1_i1 at operation 4 doesn’t run. Operation 5 was called before operation 4 was considered.

Finally

This brings us back to the question – why would anyone run a strange query like this.

Perhaps the answer is that it’s just a demonstration of one part of a more complex query and what we’re trying to do is say: “if a certain record exists in a control table then include some information from table X”.

This note tells us that if there’s a possibility of a min/max optimisation to find the data then we should avoid using a join, instead we should use a “fixed subquery” to check the control table, and maybe we’ll also have to write the part of our query that collects (or isn’t required to collect) the interesting bit of data as an inline view.

 

2 Comments »

  1. Hi Mr. Lewis! Question:

    “There is a maxim (or guideline, or rule of thumb) that if the from clause of a query includes tables that don’t get referenced in the select list then those tables should (probably) appear in subqueries.”

    Can you explain the reasoning behind this? I’ve always thought the opposite…”if given the choice between an inner join and a subquery, you should use an inner join” to allow things like possibly more flexibility with the join order, and to not have the overhead associated with the unnested “semi-” joins.

    Comment by Kaley Crum — July 13, 2020 @ 1:20 pm BST Jul 13,2020 | Reply

    • Kaley,

      Thanks for the question.,

      I don’t know where this guideline came from and I’ve never heard a formal justification; but I could imagine there are various points that can be combined to favour the approach. In no particular orde

      • Correctness: joins have the capacity for multiplying up the number of rows in the output; if you have a choice between a subquery and and a join then you’re introducing the risk of getting the wrong answer by picking the join
      • Comprehension: a good starting point for writing (and later debugging) the SQL is to translate from natural language to to SQL, so phrases like “only if there’s matching product” or “and the xyz is one of the abc” then the translation invites a subquery.
      • Optimizer function: we hope that the optimizer is smart enough to work out when it’s a good idea to turn a subquery that you’ve supplied into a join of some sort, and when it should keep it as a subquery, and where in the path it should execute the subquery for best effect.

      I have to say that many years ago I used to see cases fairly frequently where people had rewritten a subquery as a join, then got duplicated results and added in a “distinct” to eliminate the duplicates.

      Regards
      Jonathan Lewis

      Comment by Jonathan Lewis — July 13, 2020 @ 1:56 pm BST Jul 13,2020 | Reply


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.