Oracle Scratchpad

October 6, 2020

Index FFS Cost 2

Filed under: Bugs,CBO,Indexing,Oracle,Statistics — Jonathan Lewis @ 12:58 pm BST Oct 6,2020

Here’s a little puzzle, highlighting a “bug that’s not a bug” that was “fixed but not fixed” some time in the 10.2 timeline. (If you want specifics about exactly when the fix became available and what patches might be available they’re in MOS – Bug 5099019 : DBMS_STATS DOESN’T COUNT LEAF_BLOCKS CORRECTLY.

Running 19.3.0.0, with the system statistics as shown:


begin
        dbms_stats.set_system_stats('MBRC',16);
        dbms_stats.set_system_stats('MREADTIM',10);
        dbms_stats.set_system_stats('SREADTIM',5);
        dbms_stats.set_system_stats('CPUSPEED',500);
end;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                cascade          => true,
                estimate_percent => 100,
                method_opt       => 'for all columns size 1'
        );
end;
/

select
        index_name, leaf_blocks
from
        user_indexes
where
        table_name = 'T1'
;

alter system flush buffer_cache;

set autotrace traceonly 

select
        count(small_vc)
from    t1
where   small_vc is not null
;

set autotrace off

For my test run there are no uncommitted transactions, the gathering of table stats with cascade to indexes means all the blocks are clean so there’s no undo activity needed to produce an answer for the final query, and all that the query is going to do is run an index fast full scan to count the number of rows in the table because there’s an index on just (small_vc).

The system stats tell us that Oracle is going to cost the index fast full scan by taking the leaf block count, dividing by 16 (the MBRC) and multiplying by 2 ( mreadtim / sreadtim) and adding a bit for CPU usage. So here are the results from running the query and generating the plan with autotrace:


PL/SQL procedure successfully completed.

INDEX_NAME           LEAF_BLOCKS
-------------------- -----------
T1_V1                        153

1 row selected.

System altered.

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 274579903

-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |    22   (5)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_V1 |  9999 |   107K|    22   (5)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("SMALL_VC" IS NOT NULL)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       1530  consistent gets
       1522  physical reads
          0  redo size
        558  bytes sent via SQL*Net to client
        425  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Points to notice:

  • The cost is (as predicted):  ceiling(153 leaf_blocks / 16 MBRC ) * 2 + a bit:  = 20 + bit.
  • The number of physical blocks read is 1522, not the 153 leaf blocks reported in index stats
  • No recursive SQL, and I did say that there were no undo / read consistency issues to worry about

There is clearly an inconsistency between the size of the index and the (100% gathered) leaf_block count that the optimizer is using, and this is a point I made many years ago in “Cost Based Oracle – Fundamentals”.  To gather the leaf block count Oracle has looked at the number of leaf blocks that, after cleanout – and here’s how I prepared this demo:


rem
rem     Script:         index_ffs_cost.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2012
rem

create table t1
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid wordpress format issue
)
select
        rownum                  id,
        lpad(rownum,10,'0')     small_vc,
        rpad('x',100)           padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid wordpress format issue
;

create index t1_v1 on t1(small_vc) pctfree 80;

delete
        from t1
where
        id between 5000 and 95000
;

commit;

When I gathered stats on the index I had just deleted 90% of the data from the table – so 90% of the leaf blocks in the index were empty but still in the index structure, and only 10% of the leaf blocks held any current index entries. This is why Oracle reported leaf_blocks = 153 while the index fast full scan had to read through nearly 1530 blocks.

This is one of those contradictory problems – for an index fast full scan you need to know the size of the segment that will be scanned because that’s the number of blocks you will have to examine; but in most cases the number of populated index leaf blocks is the number you need to know about when you’re trying to cost an index range scan. Of course in most cases the nature of Oracle’s implementation of B-tree indexes will mean that the two counts will be within one or two percent of each other. But there are a few extreme cases where you could get an index into a state where the segment size is large and the data set is small and you don’t want the optimizer to think that an index fast full scan will be low-cost.

Oracle produced a mechanism for getting the segment block count captured as the leaf_blocks statistic late in 10.2, but it’s not implemented by default, and it’s not something you can tweak into a query with the opt_param() hint. Fix control 509019 has the description: “set leaf blocks to the number of blocks in the index extent map”, but it’s not a run-time fix, it’s a fix that has to be in place when you gather the  index stats – thus:

alter session set "_fix_control"='5099019:1';

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                cascade          => true,
                estimate_percent => 100,
                method_opt       => 'for all columns size 1'
        );
end;
/

select
        index_name, leaf_blocks
from
        user_indexes
where
        table_name = 'T1'
;

alter system flush buffer_cache;

set autotrace traceonly 

select
        count(small_vc)
from    t1
where   small_vc is not null
;

set autotrace off

===================================================

Session altered.

PL/SQL procedure successfully completed.

INDEX_NAME           LEAF_BLOCKS
-------------------- -----------
T1_V1                       1555

1 row selected.

System altered.

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 274579903

-------------------------------------------------------------------------------
| Id  | Operation             | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |       |     1 |    11 |   201   (3)| 00:00:01 |
|   1 |  SORT AGGREGATE       |       |     1 |    11 |            |          |
|*  2 |   INDEX FAST FULL SCAN| T1_V1 |  9999 |   107K|   201   (3)| 00:00:01 |
-------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("SMALL_VC" IS NOT NULL)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1530  consistent gets
       1522  physical reads
          0  redo size
        558  bytes sent via SQL*Net to client
        439  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Session altered.

As you can see Oracle has now set leaf_blocks = 1555. This is actually the number of blocks below the highwater mark in the segment. A further check with the dbms_space and dbms_space_usage packages showed that the number of blocks in the index structure was actually 1,523 with a further 32 blocks that we could infer were space management blocks. Of the 1,523 blocks in the index structure 157 were reported as “FULL” while 1364 were reported as FS2 which possibly ought to mean  “available for re-use” (though still in the structure), although this didn’t quite seem to be the case a few years ago.

Although Oracle has supplied a fix to a problem I highlighted in CBO-F, I can understand why it’s not enabled by default, and I don’t think I’d want to take advantage of it in a production system given the way it’s a session setting at stats gathering time. The number of times it likely to matter I’d probably add hints to the SQL to stop the optimizer from using the index incorrectly, or do something at a stats-gathering moment to call dbms_stats.set_index_stats().

10 Comments »

  1. The _table_scan_cost_plus_one parameter also hardcodes additional cost to FFS, that’s why “a bit” is 2.
    I feel there would have been scope for an ER for index stats here. The total number of blocks (leaf and branch) for the segment could be stored and used for FFS. It’s probably one of those cases where it has been too baked in now and adjusting the calculation would lead to a lot of tricky plan flips.

    Comment by Andrew Sayer — October 6, 2020 @ 1:28 pm BST Oct 6,2020 | Reply

    • Jonathan,
      Andy,

      I’d like to offer an alternative explanation for the small deviation (“a bit”). mreadtim / sreadtim / mbrc is not multipled exactly with the number of blocks to get the IO cost. I found a more accurate formula to be:
      io_cost = (blocks + 1) * mbrc * mreadtim / sreadtim + 1

      Where do both + 1s come from?

      Internally, the optimizer calculates parts of the equation in different functions (kkesScaleIO and kkesIOScaleFactor). Parts of the formula are encapsulated within those functions to reuse them as building blocks for different calculations. +1s are most probably safeguards against division by zero.

      Best regards,

      Nenad

      Comment by Nenad Noveljic — October 6, 2020 @ 3:17 pm BST Oct 6,2020 | Reply

  2. Andy, Nenad,

    I wasn’t planning to go into too much detail about the exact cost calculation, which is why I didn’t mention the “_table_scan_cost_plus_one” and its application to index fast full scan. (I think it’s in “The Book” somewhere.)

    The kkeScaleIO and kksIOScaleFactor functions are an interesting one, I wonder if they’re the implementation of optimizer_index_caching and optimizer_index_cost_adj.

    One of the details I bypassed inline is that there’s presumably a rounding effect going on – 153 is not an exact multiple of 16 (MBRC), so Oracle rounds up to 160 before considering the timing. If you ran the test through the 10053 trace you’d find:

      Cost: 21.995776  Degree: 1  Card: 9999.000000  Bytes: 109989.000000
      Resc: 21.995776  Resc_io: 21.000000  Resc_cpu: 2489440
    

    My interpretation of the Resc_io is that it’s: . (160 /16) * 10 / 5 + 1 = 21

    Then the “bit” is (2,489,440 / 500,000,000 seconds) * 200 = 0.995776
    CPU Speed = 500 (M ops per second) with 2,489,449 required and 200 single block reads = 1 second.

    Regards
    Jonathan Lewis

    Comment by Jonathan Lewis — October 6, 2020 @ 4:02 pm BST Oct 6,2020 | Reply

    • Jonathan, Andy,

      kkesIOScaleFactor returns mbrc * sreadtim / mreadtim

      kkesScaleIO returns ( blocks + 1 ) / kkesIOScaleFactor + 1

      Best regards,
      Nenad

      Comment by Nenad Noveljic — October 6, 2020 @ 4:43 pm BST Oct 6,2020 | Reply

      • Nenad,

        Interesting. The problem with those calls is that there then has to be something else to account for the difference between the result it gives and the actual results. Are there any indications of integer rounding in that code path. The trace shows 21.000000 + 0.995776, which means the algorithm you’ve observed has to match to 21.000000. But actually it gives: 154/8 + 1 = 19.25 + 1 = 20.25 so we need some rounding.

        Regards
        Jonathan Lewis

        Comment by Jonathan Lewis — October 6, 2020 @ 5:05 pm BST Oct 6,2020 | Reply

        • Jonathan,

          Internally, there are further adjustments to the kkesScaleIO result. For example, a part of the IO sort calculation in 18c looks like this:

          io_sort_cost_per_pass = floor (2 * kkesScaleIO )

          The rounding was done later on in kkeSortCosts.

          I’ll check the workflow for your example, i.e. FFS in 19c.

          Best regards,
          Nenad

          Comment by Nenad Noveljic — October 6, 2020 @ 5:27 pm BST Oct 6,2020

      • Nenad,

        I’ve just run a quick loop test (on 12.2.0.1), hacking the leaf_block count from 144 to 160.

        At 144 the Resc_io is 19.000000
        At 145 the Resc_io changes to 20.000000
        At 153 the Resc_io changes to 21.000000

        This seems to be more consistent with ceiling(leaf_blocks / kkesIOScaleFactor ) + 1

        Update: the same pattern appears with the tablescan costing and the “blocks” statistic:

        At 1728 the Resc_io is 217.000000
        At 1729 the Resc_io changes to 218.000000
        At 1736 the Resc_io changes to 219.000000

        Regards
        Jonathan Lewis

        Comment by Jonathan Lewis — October 6, 2020 @ 5:24 pm BST Oct 6,2020 | Reply

        • Jonathan,

          Yes. kkesScaleIO seems to return an integer value (though, stored as double).

          Best regards,
          Nenad

          Comment by Nenad Noveljic — October 6, 2020 @ 5:50 pm BST Oct 6,2020

        • Jonathan,

          It’s worth noting that kkesScaleIO != Resc_io.

          The kkesScaleIO is adjusted by the caller (table scan, index scan, sort, etc.) to obtain the final IO cost (Resc_io).

          The exact formula for kkesScaleIO is:

          kkesScaleIO = floor ( ( blocks + 1 ) / kkesIOScaleFactor ) + 1 
          

          Floor was missing in my initial equation. Adding 1 to blocks is important for avoiding kkesScaleIO=0 when blocks=0.

          The final IO cost in your example is calculated as Resc_io = kkesScaleIO + 1.

          This calculation fits well in both yours and mine case. I verified everything by tracing the outputs of the functions kkeTbScanIOCost and kkesScaleIO.

          Best regards,

          Nenad

          Comment by Nenad Noveljic — October 8, 2020 @ 1:55 pm BST Oct 8,2020

        • Nenad,

          Thanks for the revelation.

          It’s nice to know exactly what’s actually happening. My description (round up to next multiple of MBRC and divide by MBRC) might have been the concept behind the calculation but the mechanism you’ve described is obviously (with hindsight) a better way to implement it.

          Regards
          Jonathan Lewis

          Comment by Jonathan Lewis — October 9, 2020 @ 12:11 pm BST Oct 9,2020


RSS feed for comments on this post. TrackBack URI

Comments and related questions are welcome.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.