Here’s a little puzzle, highlighting a “bug that’s not a bug” that was “fixed but not fixed” some time in the 10.2 timeline. (If you want specifics about exactly when the fix became available and what patches might be available they’re in MOS – Bug 5099019 : DBMS_STATS DOESN’T COUNT LEAF_BLOCKS CORRECTLY.
Running 19.3.0.0, with the system statistics as shown:
begin dbms_stats.set_system_stats('MBRC',16); dbms_stats.set_system_stats('MREADTIM',10); dbms_stats.set_system_stats('SREADTIM',5); dbms_stats.set_system_stats('CPUSPEED',500); end; begin dbms_stats.gather_table_stats( ownname => user, tabname => 'T1', cascade => true, estimate_percent => 100, method_opt => 'for all columns size 1' ); end; / select index_name, leaf_blocks from user_indexes where table_name = 'T1' ; alter system flush buffer_cache; set autotrace traceonly select count(small_vc) from t1 where small_vc is not null ; set autotrace off
For my test run there are no uncommitted transactions, the gathering of table stats with cascade to indexes means all the blocks are clean so there’s no undo activity needed to produce an answer for the final query, and all that the query is going to do is run an index fast full scan to count the number of rows in the table because there’s an index on just (small_vc).
The system stats tell us that Oracle is going to cost the index fast full scan by taking the leaf block count, dividing by 16 (the MBRC) and multiplying by 2 ( mreadtim / sreadtim) and adding a bit for CPU usage. So here are the results from running the query and generating the plan with autotrace:
PL/SQL procedure successfully completed. INDEX_NAME LEAF_BLOCKS -------------------- ----------- T1_V1 153 1 row selected. System altered. 1 row selected. Execution Plan ---------------------------------------------------------- Plan hash value: 274579903 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 11 | 22 (5)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 11 | | | |* 2 | INDEX FAST FULL SCAN| T1_V1 | 9999 | 107K| 22 (5)| 00:00:01 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("SMALL_VC" IS NOT NULL) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 1530 consistent gets 1522 physical reads 0 redo size 558 bytes sent via SQL*Net to client 425 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
Points to notice:
- The cost is (as predicted): ceiling(153 leaf_blocks / 16 MBRC ) * 2 + a bit: = 20 + bit.
- The number of physical blocks read is 1522, not the 153 leaf blocks reported in index stats
- No recursive SQL, and I did say that there were no undo / read consistency issues to worry about
There is clearly an inconsistency between the size of the index and the (100% gathered) leaf_block count that the optimizer is using, and this is a point I made many years ago in “Cost Based Oracle – Fundamentals”. To gather the leaf block count Oracle has looked at the number of leaf blocks that, after cleanout. hold any index entries – and here’s how I prepared this demo to confuse the issue:
rem rem Script: index_ffs_cost.sql rem Author: Jonathan Lewis rem Dated: July 2012 rem create table t1 as with generator as ( select --+ materialize rownum id from dual connect by level <= 1e4 -- > comment to avoid wordpress format issue ) select rownum id, lpad(rownum,10,'0') small_vc, rpad('x',100) padding from generator v1, generator v2 where rownum <= 1e5 -- > comment to avoid wordpress format issue ; create index t1_v1 on t1(small_vc) pctfree 80; delete from t1 where id between 5000 and 95000 ; commit;
When I gathered stats on the index I had just deleted 90% of the data from the table – so 90% of the leaf blocks in the index were empty but still in the index structure, and only 10% of the leaf blocks held any current index entries. This is why Oracle reported leaf_blocks = 153 while the index fast full scan had to read through nearly 1530 blocks.
This is one of those contradictory problems – for an index fast full scan you need to know the size of the segment that will be scanned because that’s the number of blocks you will have to examine; but in most cases the number of populated index leaf blocks is the number you need to know about when you’re trying to cost an index range scan. Of course in most cases the nature of Oracle’s implementation of B-tree indexes will mean that the two counts will be within one or two percent of each other. But there are a few extreme cases where you could get an index into a state where the segment size is large and the data set is small and you don’t want the optimizer to think that an index fast full scan will be low-cost.
Oracle produced a mechanism for getting the segment block count captured as the leaf_blocks statistic late in 10.2, but it’s not implemented by default, and it’s not something you can tweak into a query with the opt_param() hint. Fix control 509019 has the description: “set leaf blocks to the number of blocks in the index extent map”, but it’s not a run-time fix, it’s a fix that has to be in place when you gather the index stats – thus:
alter session set "_fix_control"='5099019:1'; begin dbms_stats.gather_table_stats( ownname => user, tabname =>'T1', cascade => true, estimate_percent => 100, method_opt => 'for all columns size 1' ); end; / select index_name, leaf_blocks from user_indexes where table_name = 'T1' ; alter system flush buffer_cache; set autotrace traceonly select count(small_vc) from t1 where small_vc is not null ; set autotrace off =================================================== Session altered. PL/SQL procedure successfully completed. INDEX_NAME LEAF_BLOCKS -------------------- ----------- T1_V1 1555 1 row selected. System altered. 1 row selected. Execution Plan ---------------------------------------------------------- Plan hash value: 274579903 ------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 11 | 201 (3)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 11 | | | |* 2 | INDEX FAST FULL SCAN| T1_V1 | 9999 | 107K| 201 (3)| 00:00:01 | ------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("SMALL_VC" IS NOT NULL) Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 1530 consistent gets 1522 physical reads 0 redo size 558 bytes sent via SQL*Net to client 439 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed Session altered.
As you can see Oracle has now set leaf_blocks = 1555. This is actually the number of blocks below the highwater mark in the segment. A further check with the dbms_space and dbms_space_usage packages showed that the number of blocks in the index structure was actually 1,523 with a further 32 blocks that we could infer were space management blocks. Of the 1,523 blocks in the index structure 157 were reported as “FULL” while 1364 were reported as FS2 which possibly ought to mean “available for re-use” (though still in the structure), although this didn’t quite seem to be the case a few years ago.
Although Oracle has supplied a fix to a problem I highlighted in CBO-F, I can understand why it’s not enabled by default, and I don’t think I’d want to take advantage of it in a production system given the way it’s a session setting at stats gathering time. The number of times it likely to matter I’d probably add hints to the SQL to stop the optimizer from using the index incorrectly, or do something at a stats-gathering moment to call dbms_stats.set_index_stats().
The _table_scan_cost_plus_one parameter also hardcodes additional cost to FFS, that’s why “a bit” is 2.
I feel there would have been scope for an ER for index stats here. The total number of blocks (leaf and branch) for the segment could be stored and used for FFS. It’s probably one of those cases where it has been too baked in now and adjusting the calculation would lead to a lot of tricky plan flips.
Comment by Andrew Sayer — October 6, 2020 @ 1:28 pm BST Oct 6,2020 |
Jonathan,
Andy,
I’d like to offer an alternative explanation for the small deviation (“a bit”). mreadtim / sreadtim / mbrc is not multipled exactly with the number of blocks to get the IO cost. I found a more accurate formula to be:
io_cost = (blocks + 1) * mbrc * mreadtim / sreadtim + 1
Where do both + 1s come from?
Internally, the optimizer calculates parts of the equation in different functions (kkesScaleIO and kkesIOScaleFactor). Parts of the formula are encapsulated within those functions to reuse them as building blocks for different calculations. +1s are most probably safeguards against division by zero.
Best regards,
Nenad
Comment by Nenad Noveljic — October 6, 2020 @ 3:17 pm BST Oct 6,2020 |
Andy, Nenad,
I wasn’t planning to go into too much detail about the exact cost calculation, which is why I didn’t mention the “_table_scan_cost_plus_one” and its application to index fast full scan. (I think it’s in “The Book” somewhere.)
The kkeScaleIO and kksIOScaleFactor functions are an interesting one, I wonder if they’re the implementation of optimizer_index_caching and optimizer_index_cost_adj.
One of the details I bypassed inline is that there’s presumably a rounding effect going on – 153 is not an exact multiple of 16 (MBRC), so Oracle rounds up to 160 before considering the timing. If you ran the test through the 10053 trace you’d find:
My interpretation of the Resc_io is that it’s: . (160 /16) * 10 / 5 + 1 = 21
Then the “bit” is (2,489,440 / 500,000,000 seconds) * 200 = 0.995776
CPU Speed = 500 (M ops per second) with 2,489,449 required and 200 single block reads = 1 second.
Regards
Jonathan Lewis
Comment by Jonathan Lewis — October 6, 2020 @ 4:02 pm BST Oct 6,2020 |
Jonathan, Andy,
kkesIOScaleFactor returns mbrc * sreadtim / mreadtim
kkesScaleIO returns ( blocks + 1 ) / kkesIOScaleFactor + 1
Best regards,
Nenad
Comment by Nenad Noveljic — October 6, 2020 @ 4:43 pm BST Oct 6,2020 |
Nenad,
Interesting. The problem with those calls is that there then has to be something else to account for the difference between the result it gives and the actual results. Are there any indications of integer rounding in that code path. The trace shows 21.000000 + 0.995776, which means the algorithm you’ve observed has to match to 21.000000. But actually it gives: 154/8 + 1 = 19.25 + 1 = 20.25 so we need some rounding.
Regards
Jonathan Lewis
Comment by Jonathan Lewis — October 6, 2020 @ 5:05 pm BST Oct 6,2020 |
Jonathan,
Internally, there are further adjustments to the kkesScaleIO result. For example, a part of the IO sort calculation in 18c looks like this:
io_sort_cost_per_pass = floor (2 * kkesScaleIO )
The rounding was done later on in kkeSortCosts.
I’ll check the workflow for your example, i.e. FFS in 19c.
Best regards,
Nenad
Comment by Nenad Noveljic — October 6, 2020 @ 5:27 pm BST Oct 6,2020
Nenad,
I’ve just run a quick loop test (on 12.2.0.1), hacking the leaf_block count from 144 to 160.
At 144 the Resc_io is 19.000000
At 145 the Resc_io changes to 20.000000
At 153 the Resc_io changes to 21.000000
This seems to be more consistent with ceiling(leaf_blocks / kkesIOScaleFactor ) + 1
Update: the same pattern appears with the tablescan costing and the “blocks” statistic:
At 1728 the Resc_io is 217.000000
At 1729 the Resc_io changes to 218.000000
At 1736 the Resc_io changes to 219.000000
Regards
Jonathan Lewis
Comment by Jonathan Lewis — October 6, 2020 @ 5:24 pm BST Oct 6,2020 |
Jonathan,
Yes. kkesScaleIO seems to return an integer value (though, stored as double).
Best regards,
Nenad
Comment by Nenad Noveljic — October 6, 2020 @ 5:50 pm BST Oct 6,2020
Jonathan,
It’s worth noting that kkesScaleIO != Resc_io.
The kkesScaleIO is adjusted by the caller (table scan, index scan, sort, etc.) to obtain the final IO cost (Resc_io).
The exact formula for kkesScaleIO is:
Floor was missing in my initial equation. Adding 1 to blocks is important for avoiding kkesScaleIO=0 when blocks=0.
The final IO cost in your example is calculated as Resc_io = kkesScaleIO + 1.
This calculation fits well in both yours and mine case. I verified everything by tracing the outputs of the functions kkeTbScanIOCost and kkesScaleIO.
Best regards,
Nenad
Comment by Nenad Noveljic — October 8, 2020 @ 1:55 pm BST Oct 8,2020
Nenad,
Thanks for the revelation.
It’s nice to know exactly what’s actually happening. My description (round up to next multiple of MBRC and divide by MBRC) might have been the concept behind the calculation but the mechanism you’ve described is obviously (with hindsight) a better way to implement it.
Regards
Jonathan Lewis
Comment by Jonathan Lewis — October 9, 2020 @ 12:11 pm BST Oct 9,2020