Oracle Scratchpad

March 3, 2023

Empty indexes

Filed under: Indexing,Oracle — Jonathan Lewis @ 3:01 pm GMT Mar 3,2023

Here’s a little quiz that came to light on a database running 11g. It’s one of those anomalies that is likely to get noticed only in fairly extreme cases and it’s about DDL rather than DML so it shouldn’t be happening very often on anyone’s system.

Read through the following script which simply creates a table, populates it with 4 million rows and creates and index; then answer the question that follows:

rem
rem     Script:         pt_iot_oddity.sql
rem     Author:         Jonathan Lewis
rem     Dated:          March 2023
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem             11.2.0.4
rem

create table t1 (
        key1            varchar2(20),
        secondary       varchar2(10),
        other           varchar2(10),
        constraint t1_pk primary key (key1)
)
/

insert into t1( key1, secondary, other)
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        lpad(rownum,20),
        null,
        lpad(rownum,10)
from
        generator v1,
        generator v2
where
        rownum <= 4e6
/

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

create index t1_i1 on t1(secondary) parallel 4;

Take note that the index is a single column B-tree index on a column which is NULL for every single row; and remind yourselves that completely null entries not recorded in B-tree indexes – if you run the script you will find that the index has no entries: user_indexes.num_rows = 0. So the question is: how many rows does Oracle sort while creating the index?

To help you out, perhaps, here’s the execution plan:

------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------
|   0 | CREATE INDEX STATEMENT   |          |  4000K|    26M|  1183   (7)| 00:00:06 |        |      |            |
|   1 |  PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (ORDER)     | :TQ10001 |  4000K|    26M|            |          |  Q1,01 | P->S | QC (ORDER) |
|   3 |    INDEX BUILD NON UNIQUE| T1_I1    |       |       |            |          |  Q1,01 | PCWP |            |
|   4 |     SORT CREATE INDEX    |          |  4000K|    26M|            |          |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE          |          |  4000K|    26M|   816  (10)| 00:00:05 |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANGE      | :TQ10000 |  4000K|    26M|   816  (10)| 00:00:05 |  Q1,00 | P->P | RANGE      |
|   7 |        PX BLOCK ITERATOR |          |  4000K|    26M|   816  (10)| 00:00:05 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL| T1       |  4000K|    26M|   816  (10)| 00:00:05 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------

Note
-----
   - estimated index size: 100M bytes


And here’s another little bit of information (cut-n-paste after creating the index) just in case you didn’t believe me when I said completely null keys don’t appear in Oracle’s B-tree indexes:

SQL> select index_name, num_rows from user_indexes where table_name = 'T1';

INDEX_NAME             NUM_ROWS
-------------------- ----------
T1_PK                   4059833
T1_I1                         0


Special note for Richard Foote: you’re allowed to think about the question, but you’re not allowed to answer it in the comments.

Answer

The starting thought to this one is this: we know that completely null entries don’t get into B-tree indexes so when, in the plan above, does Oracle discard the nulls. There are three possible places where it might do so:

  • At the table access full
  • At the sort
  • At the index build

I’m not counting the “PX send” as that’s more or less tied into the table “access full”, similarly the “PX receive” is tied into the sort operation.

Ideally, of course, we’d hope that Oracle would discard the nulls as soon as possible – i.e. during the table access full. There are potentiall 4 ways (at least) that we can check to see what actually happens:

  • Looking at the SQL Monitor output
  • Looking at the dbms_xplan.display_cursor() output with rowsource execution stats enabled
  • Looking at the v$pq_tqstat information immediately after creating the index.
  • Looking at the output from the 10046 trace
  • Looking at the results of a 10032 (sort) trace

Here’s the execution plan body from the SQL Monitor report in 11.2.0.4:

SQL Plan Monitoring Details (Plan Hash Value=2666861883)
===============================================================================================================================================================================================
| Id |         Operation          |   Name   |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  | Write | Write |  Mem  | Temp  | Activity |       Activity Detail       |
|    |                            |          | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)    |         (# samples)         |
===============================================================================================================================================================================================
|  0 | CREATE INDEX STATEMENT     |          |         |      |         1 |     +4 |     9 |        4 |      |       |       |       |       |       |          |                             |
|  1 |   PX COORDINATOR           |          |         |      |         4 |     +1 |     9 |        4 |      |       |       |       |       |       |    18.18 | Cpu (2)                     |
|  2 |    PX SEND QC (ORDER)      | :TQ10001 |      4M |      |         1 |     +4 |     4 |        4 |      |       |       |       |       |       |          |                             |
|  3 |     INDEX BUILD NON UNIQUE | T1_I1    |         |      |         1 |     +4 |     4 |        4 |      |       |       |       |       |       |          |                             |
|  4 |      SORT CREATE INDEX     |          |      4M |      |         2 |     +3 |     4 |       4M |  922 |  48MB |   243 |  48MB |   95M |   52M |    27.27 | Cpu (3)                     |
|  5 |       PX RECEIVE           |          |      4M |  816 |         3 |     +2 |     4 |       4M |      |       |       |       |       |       |     9.09 | Cpu (1)                     |
|  6 |        PX SEND RANGE       | :TQ10000 |      4M |  816 |         2 |     +2 |     4 |       4M |      |       |       |       |       |       |    27.27 | Cpu (3)                     |
|  7 |         PX BLOCK ITERATOR  |          |      4M |  816 |         1 |     +3 |     8 |       4M |      |       |       |       |       |       |          |                             |
|  8 |          TABLE ACCESS FULL | T1       |      4M |  816 |         3 |     +1 |    56 |       4M |  247 | 165MB |       |       |       |       |    18.18 | Cpu (1)                     |
|    |                            |          |         |      |           |        |       |          |      |       |       |       |       |       |          | db file sequential read (1) |
===============================================================================================================================================================================================


Under “Rows (actual)” we can see operation 4 (Sort create index) supplying 4M rows to operation 3 (Index Build). If that’s not entirely convincing, here’s the output from v$pq_tqstat:

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Ranger                 1 QC                      12        528    100.00     100.00          5          1           0
                      Producer               1 P004               1007528   14126554     25.19      25.19         10          1           0
                                             1 P005                930555   13047324     23.26      23.26          8          1           0
                                             1 P006               1008101   14134600     25.20      25.20          7          0           0
                                             1 P007               1053828   14775738     26.35      26.35          5          0           0
                      Consumer               1 P000               1362213   19099470     34.06      34.06          3          0           0
                                             1 P001               1155741   16204542     28.89      28.89          3          0           0
                                             1 P002                695813    9755950     17.40      17.40          3          0           0
                                             1 P003                786233   11023726     19.66      19.66          3          1           0

                    1 Producer               1 P000                     1        298     25.00      25.00          0          0           0
                                             1 P001                     1        298     25.00      25.00          0          0           0
                                             1 P002                     1        298     25.00      25.00          0          0           0
                                             1 P003                     1        298     25.00      25.00          0          0           0
                      Consumer               1 QC                       4       1192    100.00     100.00          2          0           0


Here we see the first producer (tablescan) PX processes passing 4 million rows to the first consumer processes. which then turn around to be producers and pass a single row each to the query co-ordinator. That doesn’t really tell us anything about when the consumers discarded the 4M rows, of course, but it does confirm that 4M rows passed at least some way up the tree; so here’s an extrace from each of the trace files for the consumers turned producer showing how many rows each sorted:

 grep -A+4 "\- Sort Statistics " test_p00[0-9]*165*.trc

test_p000_16583.trc:---- Sort Statistics ------------------------------
test_p000_16583.trc-Initial runs                              2
test_p000_16583.trc-Number of merges                          1
test_p000_16583.trc-Input records                             1362213
test_p000_16583.trc-Output records                            1362213
--
test_p001_16587.trc:---- Sort Statistics ------------------------------
test_p001_16587.trc-Initial runs                              2
test_p001_16587.trc-Number of merges                          1
test_p001_16587.trc-Input records                             1155741
test_p001_16587.trc-Output records                            1155741
--
test_p002_16591.trc:---- Sort Statistics ------------------------------
test_p002_16591.trc-Input records                             695813
test_p002_16591.trc-Output records                            695813
test_p002_16591.trc-Total number of comparisons performed     4230527
test_p002_16591.trc-  Comparisons performed by in-memory sort 4230527
--
test_p003_16595.trc:---- Sort Statistics ------------------------------
test_p003_16595.trc-Input records                             786233
test_p003_16595.trc-Output records                            786233
test_p003_16595.trc-Total number of comparisons performed     4778408
test_p003_16595.trc-  Comparisons performed by in-memory sort 4778408

Compare the Pnnn identifiers with the trace file names, and compare the count of consumed rows with the sort input/output record counts: the PX processes did sort the rows that received before going into the “index build” routine and discarding the null entries.

This is just one scenario of several – it’s a simple heap table, with a parallel create index of a non-unique index in an old version of Oracle. Would it be the same for a unique index, for an index on a partitioned heap table (especially if the number of partitions matched the degree of parallelism), would the type of partitioning matter, and how would things change (if at all) for an index organized table – and what about doing all the combinations of tests on a version of Oracle that is still in support.

You may have wondered, in passing, why the script for this example was called pt_iot_oddity.sql. It’s because someone asked me if I could think of a reason why it took 15 minutes to create an empty index on a partitioned index-organized table when running parallel 8. Admittedly it was a table with 650 million rows, but the time was almost all CPU, not I/O.

Here’s one variation on the test: upgrading to 19c (19.11.0.0) and creating a unique index – I’m just going to show the v$pq_tqstat information and one of the 10032 trace file extracts:

DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Ranger                 1 QC                      12       9820    100.00     100.00          2          0           0
                      Producer               1 P004               1027485   14399730     25.67      25.67         19          4           0
                                             1 P005                954287   13374652     23.84      23.84         19          4           0
                                             1 P006                993201   13919984     24.82      24.82         19          4           0
                                             1 P007               1027428   14399502     25.67      25.67         15          0           0
                      Consumer               1 P000                     0         96      0.00       0.00        807        804           0
                                             1 P001               4000000   56083760    100.00     100.00        808        803           0
                                             1 P002                     0         96      0.00       0.00      19740      19737           0
                                             1 P003                     0         96      0.00       0.00      19740      19737           0

                    1 Producer               1 P000                     1        370     25.00      25.00         26         21           0
                                             1 P001                     1        370     25.00      25.00        210         94           0
                                             1 P002                     1        370     25.00      25.00         26         21           0
                                             1 P003                     1        370     25.00      25.00         28         23           0
                      Consumer               1 QC                       4       1480    100.00     100.00          5        842           0


---- Sort Statistics ------------------------------
Initial runs                              1
Input records                             4000000
Output records                            4000000

One thing that didn’t change – the consumer PX processes sorted the data they received; one thing that did change – only one of the consumers received data. The difference is in the change from non-unique to unique, not from 11g to 19c.

A non-unique index is “made unique” by the addition of the table’s rowid to the key value so that – amongst other things – if you have multiple rows with the same key value in a table block they are adjacent in the index. The rowid for a row pointed to by a unique index is “carried” as extra data and is not treated as if it were part of the key.

For parallel index creation this means for a unique index where every key is actually NULL Oracle has to “distribute” all the key values to the same PX process. For a non-unique index Oracle is trying to distribute the combination (key, rowid) evenly between the PX processes – so each PX process should get a different set of null keys corresponding to non-overlapping rowid ranges from the table.

February 10, 2023

Result Cache

Filed under: Infrastructure,latches,Oracle — Jonathan Lewis @ 12:02 pm GMT Feb 10,2023

A historic complaint about the result cache was that it did not scale. Although this complaint was often the consequence of the mechanism was being used inappropriately, there was an underlying issue that imposed a limit on how scalable (in terms of concurrency) the cache could be: it was covered by a single child latch.

A question came up recently about whether anything had changed in recent versions of Oracle, so I dusted down a script I had written in the days of 11gR1 and brought it up to date for container databases. All it does is report (from the CDB Root) all the latches, latch parents and latch children for any latch with a name like ‘Result Cache%’:

rem     Script:         result_cache_latches.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2023 (Feb 2008)
rem
rem     Last tested 
rem             21.3.0.0
rem             19.11.0.0
rem
rem     Notes:
rem     Has to be run on the CDB root, by user with select
rem     privileges on the v$ latch views.
rem

set linesize 144
set pagesize 60
set trimspool on
set tab off

spool result_cache_latches.lst

column name format a32

clear breaks
break on name skip n1 on con_id

prompt  ================
prompt  v$latch_children
prompt  ================

select  name, con_id, count(*) 
from    v$latch_children 
where   name like 'Result Cache%' 
group by name, con_id 
order by 1,2;

break on name skip 1

prompt  ==============
prompt  v$latch_parent
prompt  ==============

select  name, con_id, count(*) 
from    v$latch_parent 
where   name like 'Result Cache%' 
group by name,con_id 
order by 1,2;

prompt  =======
prompt  v$latch
prompt  =======

select  name, con_id, count(*) 
from    v$latch 
where   name like 'Result Cache%' 
group by name, con_id 
order by 1,2;

spool off

My results for 19.11.0.0 seem to suggest that there has been no change (as far as latch children go):

================
v$latch_children
================

no rows selected

==============
v$latch_parent
==============

NAME                                 CON_ID   COUNT(*)
-------------------------------- ---------- ----------
Result Cache: MB Latch                    0          1

Result Cache: RC Latch                    0          1
                                          2          1
                                          3          1

Result Cache: SO Latch                    0          1


=======
v$latch
=======

NAME                                 CON_ID   COUNT(*)
-------------------------------- ---------- ----------
Result Cache: MB Latch                    0          1

Result Cache: RC Latch                    0          1
                                          2          1
                                          3          1

Result Cache: SO Latch                    0          1

You will note, however, that I do have three latches called “Result Cache: RC Latch” – so the wrong query for latch information might fool you into thinking that there were multiple latch children unless you’ve include the con_id in your query. There are three latches, but each one is dedicated to a different PDB.

Moving on to 21.3.0.0, though, things change:

================
v$latch_children
================

NAME                                 CON_ID   COUNT(*)
-------------------------------- ---------- ----------
Result Cache: RC Latch                    0          4
Result Cache: RC Latch                    2          4
Result Cache: RC Latch                    3          4

==============
v$latch_parent
==============

NAME                                 CON_ID   COUNT(*)
-------------------------------- ---------- ----------
Result Cache: Heap Queue Latch            0          1

Result Cache: MB Latch                    0          1

Result Cache: RC Latch                    0          1
                                          2          1
                                          3          1

Result Cache: SO Latch                    0          1

Result Cache: Set                         1          1
                                          2          1
                                          3          1

Result Cache; Flush View Latch            1          1
                                          2          1
                                          3          1


12 rows selected.

=======
v$latch
=======

NAME                                 CON_ID   COUNT(*)
-------------------------------- ---------- ----------
Result Cache: Heap Queue Latch            0          1

Result Cache: MB Latch                    0          1

Result Cache: RC Latch                    0          1
                                          2          1
                                          3          1

Result Cache: SO Latch                    0          1

Result Cache: Set                         1          1
                                          2          1
                                          3          1

Result Cache; Flush View Latch            1          1
                                          2          1
                                          3          1


12 rows selected.

I have 12 latch children for the “Result Cache: RC latch” – four for each of the PDBs.

The obvious guess about the number of child latches was that it was dictated by the instance cpu_count, and this seemed to be confirmed when I did a little hacking of the cpu_count (and _disable_cpu_check). It seems a little surprising that it’s taken Oracle so long to make this change (note particularly the original date on my test script) – but splitting the cache into pieces that are each dedicated to a single CPU seems like a good idea.

Footnote

You may have noticed the cautious way in which I described my results fo 19c: “seems to suggest”. There are some latches in Oracle where the number of child latches is dictated by values like cpu_count / 16, and I didn’t feel like hacking the number of CPUs on my laptop up to 17 (or more) to see if that gave me child latches on the RC latch. There are also several patches higher than 19.11 available for 19c, so maybe the 21.3 change has already appeared in later 19c releases.

So, left as exercise to anyone with a very new version of 19c, or with a large number of CPUs, or with both options, feel free to run the script and see if you get any different results; and if you do please report back in the comments.

February 9, 2023

Global Rowids

Filed under: Indexing,Infrastructure,Oracle — Jonathan Lewis @ 4:39 pm GMT Feb 9,2023

This note is just a little background for an impending article on the costs and effects of exchanging partitions, splitting partitions, merging partitions – or any other DDL on partitions except dropping them – in the light ofr “deferred global index maintenance”.

Summary Information: while a “typical” rowid is stored in a B-tree index using 6 bytes (relative file number, block within file, row within block), a rowid for a partitioned table is stored in a global, or globally partitioned, index as 10 bytes (table partition data object id, relative file number, block within file, row within block).

You may choose to skip the rest of the article which simply demonstrates this fact, but if you do read it then it may give you a better intuitive understanding of the consequences of partition maintenance commands.

I’m going to start with a simple script to create a range partitioned table with a local and a global index.

rem
rem     Script:         global_indexes.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2023
rem
rem     Last tested 
rem             19.11.0.0
rem 

create table pt_range (
        id      not null,
        grp,
        small_vc,
        padding
)
nologging
partition by range(id) (
        partition p0200 values less than (200),
        partition p0400 values less than (400),
        partition p0600 values less than (600),
        partition p0800 values less than (800),
        partition p0100 values less than (1000)
)
as
select
        rownum - 1                      id,
        mod(rownum,50)                  grp,
        to_char(trunc(rownum/20))       small_vc,
        rpad('x',100)                   padding
from
        dual
connect by
        level <= 1e3
;

alter table pt_range 
        add constraint pt_pk primary key(id)
        using index (create unique index pt_pk on pt_range(id) local)
/


Then I’m going to run a few queries against the data dictionary, and I’ll show you the results after each query. First a list of all the objects I’ve created in a previously empty schema. (I should point out that I’m running from SQL*Plus and I haven’t shown you all the column declarations in my login.sql:

break on object_name skip 1

select
        object_name, object_type, 
        object_id, to_char(object_id,'XXXXXXXX') hex_id, 
        subobject_name
from
        user_objects 
order by
         1,2,3
;

OBJECT_NAME          OBJECT_TYPE              OBJECT_ID HEX_ID    SUBOBJECT_NAME
-------------------- ----------------------- ---------- --------- ----------------------
PT_I1                INDEX                       169810     29752

PT_PK                INDEX                       169804     2974C
                     INDEX PARTITION             169805     2974D P0200
                     INDEX PARTITION             169806     2974E P0400
                     INDEX PARTITION             169807     2974F P0600
                     INDEX PARTITION             169808     29750 P0800
                     INDEX PARTITION             169809     29751 P0100

PT_RANGE             TABLE                       169798     29746
                     TABLE PARTITION             169799     29747 P0200
                     TABLE PARTITION             169800     29748 P0400
                     TABLE PARTITION             169801     29749 P0600
                     TABLE PARTITION             169802     2974A P0800
                     TABLE PARTITION             169803     2974B P0100

I really ought to be reporting the data_object_id since that’s the column that Oracle generally uses for “physical” navigation: when you move an index, for example, its object_id stays the same but it gets a new data_object_id. Since the objects in my case are all brand new the data_object_id will match the object_id (though the INDEX and TABLE won’t have physical segments so won’t have a data_object_id at all). I’ve shown you the hexadecimal format of the (data_)object_id because that’s how the values will appear in the dump files that I’ll be producing later.

column file_no new_value m_file_no
colum  block_no new_value m_block_no

select
        dbms_rowid.rowid_object(lbid)           object_no,
        dbms_rowid.rowid_relative_fno(lbid)     file_no,
        dbms_rowid.rowid_block_number(lbid)     block_no,
        ct
from    (
        select
                lbid, count(*) ct
        from
                (
                select
                        /*+ index_ffs(pt_range(grp)) */
                        sys_op_lbid(&m_global_id, 'L', pt_range.rowid) lbid
                from
                        pt_range
                where
                        grp > = 0
                )
        group by
                lbid
        )
/


 OBJECT_NO    FILE_NO   BLOCK_NO         CT
---------- ---------- ---------- ----------
    169810         36       2310        398
    169810         36       2311        202
    169810         36       2309        400


I’ve run a messy little query using the sys_op_lbid() function to list all the index leaf blocks in the global index with their (tablespace-relative) file and block number (and number of index entries in the leaf block). I’m just going to pick the last reported leaf block and dump it:

alter system flush buffer_cache;
alter system dump datafile &m_file_no block &m_block_no;

If I examine the resulting trace file the bulk of the block will be made up of an ordered list of the index entries in the block.

row#0[8017] flag: -------, lock: 0, len=15
col 0; len 1; (1):  80
col 1; len 10; (10):  00 02 97 47 09 00 04 04 00 31

row#1[8002] flag: -------, lock: 0, len=15
col 0; len 1; (1):  80
col 1; len 10; (10):  00 02 97 47 09 00 04 05 00 24

row#2[7987] flag: -------, lock: 0, len=15
col 0; len 1; (1):  80
col 1; len 10; (10):  00 02 97 47 09 00 04 06 00 17

row#3[7972] flag: -------, lock: 0, len=15
col 0; len 1; (1):  80
col 1; len 10; (10):  00 02 97 47 09 00 04 07 00 0b


You’ll notice two things about this tiny extract:

  • First it appears to be a two-column index; that’s because Oracle handles a non-unique index by treating the table’s rowid as if it were an extra column in the index definition, so the “last column” of a non-unique index entry is the rowid of the row that that index value is pointing to.
  • Secondly that the col 1 values (the rowids) are 10 bytes rather than the “normal” 6.

If you review the code that generated the table and its global index you’ll see that the grp column should have 20 rows for each distinct value, so I’m going to pick out the 20 consecutive rows for one key value from this index leaf block, stripping out all but the rowid entries in the order they appear in the index:

col 0; len 2; (2):  c1 14

col 1; len 10; (10):   00 02 97 47    09 00 04 04 00 12
col 1; len 10; (10):   00 02 97 47    09 00 04 05 00 05
col 1; len 10; (10):   00 02 97 47    09 00 04 05 00 37
col 1; len 10; (10):   00 02 97 47    09 00 04 06 00 2a

col 1; len 10; (10):   00 02 97 48    09 00 04 84 00 12
col 1; len 10; (10):   00 02 97 48    09 00 04 85 00 06
col 1; len 10; (10):   00 02 97 48    09 00 04 85 00 38
col 1; len 10; (10):   00 02 97 48    09 00 04 86 00 2c

col 1; len 10; (10):   00 02 97 49    09 00 05 04 00 12
col 1; len 10; (10):   00 02 97 49    09 00 05 05 00 06
col 1; len 10; (10):   00 02 97 49    09 00 05 05 00 38
col 1; len 10; (10):   00 02 97 49    09 00 05 06 00 2c

col 1; len 10; (10):   00 02 97 4a    09 00 05 84 00 12
col 1; len 10; (10):   00 02 97 4a    09 00 05 85 00 06
col 1; len 10; (10):   00 02 97 4a    09 00 05 85 00 38
col 1; len 10; (10):   00 02 97 4a    09 00 05 86 00 2c

col 1; len 10; (10):   00 02 97 4b    09 00 06 04 00 12
col 1; len 10; (10):   00 02 97 4b    09 00 06 05 00 06
col 1; len 10; (10):   00 02 97 4b    09 00 06 05 00 38
col 1; len 10; (10):   00 02 97 4b    09 00 06 06 00 2c

End dump data blocks tsn: 6 file#: 36 minblk 2309 maxblk 2309

I’ve split the rowids into 5 ranges (I had 5 table partitions) of 4 rowids each. I’ve also split each rowid by putting a little extra space after the first 4 bytes. Take a look at those groups of 4 bytes and compare the values there with the hex_id values I’ve listed for the TABLE partitions above (and the first one is highlighted to make it easier.

As I claimed at the start of this note, the first 4 bytes of the rowid are the (data_)object_id of the table partition.

Consequences

For a global index the index entries for a given key value are likely to be grouped fairly well within each leaf block. The arrangement isn’t perfect, though; remember that index entries are piled on a “heap” as they arrive at a leaf block and they only appear to be in order in the dump because the leaf block “row directory” is maintained in real time to report them in order. Nevertheless, the rowids for a given key in a given table partition will all be in the same leaf block (or small set of adjacent blocks). It’s worth thinking about this when you start experimenting with the impact of partition maintenance.

For example – if you create a new table for exchange its data_object_id will be higher than any current data_object_id so when you exchange an existing partition with the new table you will (eventually) be deleting all the rowids at one part of each key’s range and inserting a batch of rowids at the “high end” of each key’s range. The overall workload due to that pattern of activity might vary dramatically with different distributions of data – and that might make you review the data you’re using in your tests.

February 2, 2023

Upgrade catalogue

Filed under: Oracle — Jonathan Lewis @ 12:32 pm GMT Feb 2,2023

This is a list of articles I’ve written that pick up some details (usually problems or bugs) when upgrading. Each entry has a date stamp and a short note of the contents. The articles are generally listed most-recent first but the catalogue is in two sections: Benefits and Bugs and Limitations.

Benefits and Side-effects

  • Window Sorts (Nov 2022): A possible performance bug that you might never notice in large window sorts disappears in 21c
  • Incomplete hinting(July 2022): Hints may “stop working” on an upgrade as a side effect of of new transformations
  • Index statistics (April 2022): Upgrading to 19c the code to gather index stats auto_sampl_size can use the “approximate NDV” mechanism on a 100% sample – better stats, but may take more time.
  • Optimizer tweak (July 2021): A little cunning appearing in 19c that eliminates redundant function calls, and gets better cardinality estimates: but that could make execution plans change.
  • Index
  • Descending Index tweak (Dec 2020): From 18c to 19c – an enhancement/correction to the optimizer code allows it to use descending indexes unhinted in more cases where they are appropriate.
  • Name lengths (Dec 2020): 19c allows all sorts of object names to be 128 characters instead of 30. You may need to adjust some of your code (formatting, or log table definitions) to allow for this.
  • Direct I/O (Oct 2020): Upgrading from 11g to 12c and above – serial “insert as select” is enhancd to do direct path reads
  • Union All Bloom (Sept 2020): Upgrade from 12.2 to 18c: the optimizer can now push a Bloom filter inside a Union ALL view.
    • Serial Bloom Filter (Sep 2020): This may have appeared in 12c, but the note is just an explanation of why a Bloom filter can improve performance even within a single process.
  • Min/Max cost error (July 2020): A note describing a bug and workarounds in min/max index scans that means they won’t always be picked when they should be. Not fixed before 19.11
    • Min/Max index scan (May 2018): Enhanced in 12.2 to be usable in more cases than previously, may result in execution plans changing significantly.
  • Recursive With (July 2020): A significant change appearing in 12.2 that increased the cost and cardinality estimates for a recursive with query block – resulting in changes in execution plans.
  • _cursor_obsolete_threshold (Oct 2019): Upgrades to 12.2 and above increase this parameter enormously. It might cause excess CPU usage on badly coded applications.
  • no change update (Sept2019): The upgrade to 12.2 attempted to reduce redundant redo generation, but could increase it in bad cases; a further enhancement in 19.11 addressed this.
  • New delete pattern (June 2014): Upgrading from 11g you will find that Oracle can delete through an index fast full scan. This is not always a good idea.

Bugs and Limitations

  • Upgrade Surprise (March 2022): Upgrading from 12c to 19c the optimizer will stop unnesting some subqueries based on expressions.
  • DCL (lateral decorrelation) (April 2021): You can use lateral views in 12c and above, but Oracle can “decorrelate” them and sometimes gets the wrong results. (On the plus side, subqueries can have correlation predicates more than one step down).
  • Pivot cosmetics (Feb 2021): Upgrading from 11g – some reports using pivot may need cosmetic adjustments as a side effect of the legality of longer column names.
  • Aggregate Case Bug (Aug 2020): A bug introduced in 12.2, fixed in 21c, but not reported fixed by 19.14, though backports do exist.
  • Join Elimination bug (Feb 2020): Upgrading to 12.2, a plan using join elimination could produce wrong results. Fixed in 19.11, bnt still present in some earlier 19c versions.
  • Char() problems (Feb 2020): Various wrong results or performance problems with char() – i.e. fixed width – types introduced in 12.2 with possible fixes in later versions.
  • Dictionary cache stress (April 2019): A bug introduced in the upgrade from 12.2 to 18c – fixed by 19,3 – does excess gets on the dictionary cache for “describe” and related API calls.
  • Importing Statistics (April 2018): If you’re still using exp/imp after upgrading to 12c you should make the change or you may lose hybrid and top-frequency histograms on the import.

February 1, 2023

Exchange Partition

Filed under: Indexing,Infrastructure,Oracle,Partitioning,unfinished — Jonathan Lewis @ 3:04 pm GMT Feb 1,2023

A question appeared recently in a comment on a blog note I wrote about the new (in 12c) deferred global index maintenance that allowed the execution of a drop partition to return very quickly. The question was in two parts:

  • Why is an exchange partition so slow by comparison?
  • Is there any way to make it quick?

It occurred to me that it might be possible to do something for some cases by taking advantage of the partial indexing feature that also appeared in 12c. It wouldn’r eliminate the need for index maintenance, of course, and you clearly couldn’t avoid the insert maintenance for the incoming partition completely … but you might be able to ensure that it happens only at a time you find convenient.

There’s a little more detail in my reply to the comment, but this note is a place holder for the article I’ll write if I ever find time to test the idea.

Lob Space redux

Filed under: Infrastructure,LOBs,Oracle,unfinished — Jonathan Lewis @ 1:56 pm GMT Feb 1,2023

At present this is just a place holder to remind me to finish commenting on (and correcting) a mistake I made when I wrote a note about the number of bytes of data you could get into an “enable storage in row” LOB before it had to be stored out of row.

In a much earlier article discussing multi-byte, variable length character sets I showed that CLOBs stored in-row are likely to take far more space than the equivalent varchar2() columns, and that securefile CLOBs would use 6 bytes less than Basicfile CLOBs. These claims are (or were) true.

In another article I then discussed the number of bytes at which a LOB that was defined as “enable storage in row” would go “out of row” and showed some results that appeared to demonstrate that the length varied with version of Oracle. This was wrong. In my testing I had not dumped any datablocks to check their actual content, instead I had relied on the (undocumented) sys_op_opnsize() function and been fooled by the results.

This note is intended to correct my error. I’ll start with a test on a BLOB, enabling storage in row, and create some data that will be close to the breakpoint for the value going out of row:

rem
rem     Script:         blob_in_row.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2023
rem
rem     Last tested 
rem             19.11.0.0
rem             11.2.0.4
rem 

create table t1b (
        id      number(3,0),
        pad     varchar2(2000),
        b1      blob
)
lob(b1) store as
        basicfile
        basblob (
                enable storage in row
        )
/

create table t1s (
        id      number(3,0),
        padding varchar2(2000),
        b1      blob
)
lob(b1) store as
        securefile
        secblob (
                enable storage in row
        )
/

alter table t1b add constraint t1b_pk primary key (id);
alter table t1s add constraint t1s_pk primary key (id);

declare
        raw_source      raw(4200);
begin
        for i in 1..15 loop
                raw_source := rpad('E',2 * 3960 + 2*i,'E');
                insert into t1b values(i,rpad(' ',1500),raw_source);
                insert into t1s values(i,rpad(' ',1500),raw_source);
                commit;
        end loop;
end;
/

select
        id, 
        dbms_rowid.rowid_relative_fno(rowid)    rel_file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        sys_op_opnsize(b1),
        dbms_lob.getlength(b1)
from
        t1b
order by
        id
/

select
        id, 
        dbms_rowid.rowid_relative_fno(rowid)    rel_file_no,
        dbms_rowid.rowid_block_number(rowid)    block_no,
        sys_op_opnsize(b1),
        dbms_lob.getlength(b1)
from
        t1s
order by
        id
/

I’ve created two tables, one to hold a basicfile blob and one to hold a securefile blob, then I’ve run a little PL/SQL loop to insert rows into the table where the first row holds a blob of 3961bytes and each row increases the length of the blob by 1. I’ve started with a string ‘E’s of twice the required length and allowed Oracle to do an implicit conversion (hextoraw) to bytes.

To avoid anomalies and hassle from chained rows etc. I’ve included a column of varchar2(1500) so that there’s no question of Oracle being able to squeeze partial rows into blocks.

After I’ve done this I’ve executed a pair of queries that report for each row the block it starts in, its apparent length according to the sys_op_opnsize(), and its length according o the dbms_lob.getlength() function.

Here are the results under 11.2.0.4, first the basicfile table, then the securefile table:

11.2.0.4

        ID REL_FILE_NO   BLOCK_NO SYS_OP_OPNSIZE(B1) DBMS_LOB.GETLENGTH(B1)
---------- ----------- ---------- ------------------ ----------------------
         1           5        199                 86                   3961
         2           5        195                 86                   3962
         3           5        196                 86                   3963
         4           5        197                 86                   3964
         5           5        197                106                   3965
         6           5        198                106                   3966
         7           5        198                106                   3967
         8           5        198                106                   3968
         9           5        198                106                   3969
        10           5      22560                106                   3970
        11           5      22560                106                   3971
        12           5      22560                106                   3972
        13           5      22560                106                   3973
        14           5      22564                106                   3974
        15           5      22564                106                   3975

        ID REL_FILE_NO   BLOCK_NO SYS_OP_OPNSIZE(B1) DBMS_LOB.GETLENGTH(B1)
---------- ----------- ---------- ------------------ ----------------------
         1           5        231                 86                   3961
         2           5        227                 86                   3962
         3           5        228                 86                   3963
         4           5        229                 86                   3964
         5           5        230                 86                   3965
         6           5      22536                 86                   3966
         7           5      22540                 86                   3967
         8           5      22541                 86                   3968
         9           5      22541                 86                   3969
        10           5      22542                 86                   3970
        11           5      22542                 86                   3971
        12           5      22542                 86                   3972
        13           5      22542                 86                   3973
        14           5      22543                 86                   3974
        15           5      22543                 86                   3975

You can see that in both cases the sys_op_opnsize() call reports only a few dozen bytes even though we know the length of the BLOB is always more than 3,960 bytes. My original mistake was to assume that this very small number appeared when the BLOB went out of row and was reporting the length of the LOB locator that had been left behind in the row.

This time I’ve going to dump blocks; and a good target block to dump is the first block in each result set that holds two rows: block (5,197) for the basicfile BLOB, and block (5,22541) for the securefile BLOB. (Interestingly the basicfile BLOB shows the sys_op_opnsize() value change in the second of the two rows in (5,197) – is that a coincidence?)

Basicfile block dump (extract)

tab 0, row 0, @0xa10
tl: 5512 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 2]  c1 05
col  1: [1500]
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
...
col  2: [4000]
 00 54 00 01 01 0c 00 00 00 01 00 00 00 01 00 00 10 de 99 8d 0f 8c 09 00 00
 00 00 00 0f 7c 00 00 00 00 00 01 ee ee ee ee ee ee ee ee ee ee ee ee ee ee
...
LOB
Locator:
  Length:        84(4000)
  Version:        1
  Byte Length:    1
  LobID: 00.00.00.01.00.00.10.de.99.8d
  Flags[ 0x01 0x0c 0x00 0x00 ]:
    Type: BLOB
    Storage: BasicFile
    Enable Storage in Row
    Characterset Format: IMPLICIT
    Partitioned Table: No
    Options: ReadWrite
  Inode:
    Size:     3980
    Flag:     0x09 [ Valid DataInRow ]
    Future:   0x00 (should be '0x00')
    Blocks:   0
    Bytes:    3964
    Version:  00000.0000000001
    Inline data[3964]

...

tab 0, row 1, @0x402
tl: 1550 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 2]  c1 06
col  1: [1500]
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 ...
col  2: [40]
 00 54 00 01 01 0c 00 00 00 01 00 00 00 01 00 00 10 de 99 8f 00 14 05 00 00
 00 00 00 0f 7d 00 00 00 00 00 02 01 40 00 d7
LOB
Locator:
  Length:        84(40)
  Version:        1
  Byte Length:    1
  LobID: 00.00.00.01.00.00.10.de.99.8f
  Flags[ 0x01 0x0c 0x00 0x00 ]:
    Type: BLOB
    Storage: BasicFile
    Enable Storage in Row
    Characterset Format: IMPLICIT
    Partitioned Table: No
    Options: ReadWrite
  Inode:
    Size:     20
    Flag:     0x05 [ Valid InodeInRow(ESIR) ]
    Future:   0x00 (should be '0x00')
    Blocks:   0
    Bytes:    3965
    Version:  00000.0000000002
    DBA Array[1]:
      0x014000d7
end_of_block_dump


At line 7 you can see that the BLOB column for the first row is reporting 4,000 bytes, although line 29 tells us that the data content is only 3,964 of those bytes and line 24 tells us we have “Valid data in row”.

At line 41 we can see that the BLOB column for the second row holds only 40 bytes, line 59 tells use we have a “Valid inode in row”, and possibly the “Size: 20” at line 58 explains why the sys_op_opnsize() changed from 86 to 106 at this point. You’ll notice at line 62 that this is the BLOB where the actual data size is just one more byte than the previous row at 3,965 bytes.

Securefile block dump (extract)

tab 0, row 0, @0xa11
tl: 5511 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 2]  c1 09
col  1: [1500]
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 ...
col  2: [3999]
 00 54 00 01 01 0c 00 80 00 01 00 00 00 01 00 00 10 de 99 96 0f 8b 48 90 0f
 85 01 00 0f 80 01 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
 ...
LOB 
Locator:
  Length:        84(3999)
  Version:        1
  Byte Length:    1
  LobID: 00.00.00.01.00.00.10.de.99.96
  Flags[ 0x01 0x0c 0x00 0x80 ]:
    Type: BLOB 
    Storage: SecureFile
    Characterset Format: IMPLICIT
    Partitioned Table: No
    Options: ReadWrite
  SecureFile Header:
    Length:   3979
    Old Flag: 0x48 [ DataInRow SecureFile ]
    Flag 0:   0x90 [ INODE Valid ]
    Layers:
      Lengths Array: INODE:3973
        01 00 0f 80 01 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee  ...

tab 0, row 1, @0x405
tl: 1548 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 2]  c1 0a
col  1: [1500]
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
...
col  2: [38]
 00 54 00 01 01 0c 00 80 00 01 00 00 00 01 00 00 10 de 99 98 00 12 40 90 00
 0c 21 00 0f 81 01 00 01 01 40 58 11 01
LOB
Locator:
  Length:        84(38)
  Version:        1
  Byte Length:    1
  LobID: 00.00.00.01.00.00.10.de.99.98
  Flags[ 0x01 0x0c 0x00 0x80 ]:
    Type: BLOB
    Storage: SecureFile
    Characterset Format: IMPLICIT
    Partitioned Table: No
    Options: ReadWrite
  SecureFile Header:
    Length:   18
    Old Flag: 0x40 [ SecureFile ]
    Flag 0:   0x90 [ INODE Valid ]
    Layers:
      Lengths Array: INODE:12
      INODE:
        21 00 0f 81 01 00 01 01 40 58 11 01

At line 7 we can see the BLOB column for the first row in the block has a length of 3,999 bytes but line 37 tells use that the next row – with one extra byte – reports a BLOB column holding only 38 bytes. We know from our code that the 3,999 bytes was 3,968 bytes of data, with 31 bytes for various fragments LOB overhead. So there is a small difference in size between basicfile and securefile BLOBs before BLOBs declared as “enable storage in row” are forced out of the row. That’s in 11g, of course, but is it the same in 19c?

19.11.0.0

        ID REL_FILE_NO   BLOCK_NO SYS_OP_OPNSIZE(B1) DBMS_LOB.GETLENGTH(B1)
---------- ----------- ---------- ------------------ ----------------------
         1          36       1672                114                   3961
         2          36       1676                114                   3962
         3          36       1680                114                   3963
         4          36       1684                114                   3964
         5          36       1684                134                   3965
         6          36       1688                134                   3966
         7          36       1688                134                   3967
         8          36       1688                134                   3968
 ...
       ID REL_FILE_NO   BLOCK_NO SYS_OP_OPNSIZE(B1) DBMS_LOB.GETLENGTH(B1)
---------- ----------- ---------- ------------------ ----------------------
         1          36       2184                114                   3961
         2          36       2188                114                   3962
         3          36       2192                114                   3963
         4          36       2196                114                   3964
         5          36       2200                114                   3965
         6          36       2204                114                   3966
         7          36       2208                114                   3967
         8          36       2212                114                   3968
         9          36       2212                132                   3969
        10          36       2216                132                   3970

Again I’ve reported the basicfile results before the securefile results. The numbers we get from sys_op_opnsize() don’t agree with the 11g results, but again we see from the dbms_lob.getlength() numbers that the first point at which we get two consecutive rows into a block is 3,964 bytes of data for the basicfile BLOB (36, 1684) and 3968 bytes for the securefile BLOB (36,2212). In this case the sys_op_opnsize() value changes on the second row of the block for both basicfile and securefile.

Of course I need to dump blocks again – just to be sure – but this time I’m going to keep the extract very short – just a few bytes of each column for the two rows in each block:

Basicfile blockdump (extract)

tab 0, row 0, @0xa10
tl: 5512 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 2]  c1 05
col  1: [1500]
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
...
col  2: [4000]
 00 70 00 01 01 0c 00 00 00 01 00 00 00 01 00 00 0c 55 e4 28 0f 8c 09 00 00
 00 00 00 0f 7c 00 00 00 00 00 01 ee ee ee ee ee ee ee ee ee ee ee ee ee ee

tab 0, row 1, @0x402
tl: 1550 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 2]  c1 06
col  1: [1500]
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
...
col  2: [40]
 00 70 00 01 01 0c 00 00 00 01 00 00 00 01 00 00 0c 55 e4 2a 00 14 05 00 00
 00 00 00 0f 7d 00 00 00 00 00 02 09 00 07 88

Securefile blockdump (extract)

tab 0, row 0, @0xa11
tl: 5511 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 2]  c1 09
col  1: [1500]
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
...
col  2: [3999]
 00 70 00 01 01 0c 00 80 00 01 00 00 00 01 00 00 0c 55 e4 31 0f 8b 48 90 0f
 85 01 00 0f 80 01 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee

tab 0, row 1, @0x405
tl: 1548 fb: --H-FL-- lb: 0x0  cc: 3
col  0: [ 2]  c1 0a
col  1: [1500]
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
...
col  2: [38]
 00 70 00 01 01 0c 00 80 00 01 00 00 00 01 00 00 0c 55 e4 33 00 12 40 90 00
 0c 21 00 0f 81 01 00 01 09 00 09 c1 01

The symbolic dumps of LOBs has disappeared in the later versions of Oracle, but it’s quite easy from the raw stream of bytes that the raw dumps for 19c match the dumps for11g, and the break points from in-row to out of row are the same in the two versions – it’s just the sys_op_opnsize() function calls that show inconsistent (and somewhat strange) behaviour.

Summary

This note walks through an examination of “in-row” BLOBs to find out how large the data can get before the BLOB is forced out of row.

We find that the breakpoint for securefile BLOBs is a few bytes larger than it is for basicfile BLOBs, 3,968 bytes compared to 3,964 bytes. These sizes, and this difference is consistent between 11.2.0.4 and 19.11.0.0.

There does seem to be a tiny difference securefiles and basicfiles: LOB overhead + raw data = 4,000 for basicfiles while it is 3,999 for securefiles.

Repeating the tests for difference character sets (single byte, multibyte fixed, and multibyte varying) is left as an exercise to the reader.

January 30, 2023

Lost Or-Expand

Filed under: 12c,19c,CBO,Oracle,Transformations,Upgrades — Jonathan Lewis @ 1:39 pm GMT Jan 30,2023

I’ve commented previously on the “new” cost-based Or-Expansion introduced in 12c to replace the “legacy” Concatenation transformation, and I’ve been re-running some of my concatenation scripts to see whether the most recent versions of the optimizer will use Or-expansion unhinted in places where I’ve previously had to use hints to force concatenation to appear.

The latest test has produced a surprising result – I’ve got an example where 19c and 21c will use concatenation when hinted with use_concat(), but will not obey the or_expand() hint on the grounds that there’s “No valid predicate for OR expansion”

It’s worth knowing this could happen if you’re upgrading from 11g to 19c (as many people seem to be doing at present) as you may find that you have some statements that used to use concatenation unhinted, but now need to be hinted to do so as they can’t switch to or-expansion and won’t use concatenation unless hinted to do so.

tl;dr (the rest of the note is just a demonstration.) When you upgrade from 11g to 19c (or later) you may find that some queries perform badly because they stop using the legacy “concatenation” operator but can’t be transformed by the new “cost-based Or Expand” operator, and need to be hinted with a use_concat() hint.

Here’s a statement I can use to demonstrate the effect – I’ll post the code to create the tables at the end of the note:

select  /*+ gather_plan_statistics */
        n1, n2, small_vc
from
        t1
where
        (n1 = 1 and n2 = 10000)
or      (n1 = 10000 and n2 = 1)
;

I’ve rigged the data so that there are 9,999 distinct values of n1 each with one row, and 10,001 rows with the value 10,000; and I’ve done the same with n2 – 9,999 distinct values with one row each and 10,001 rows with the value 10,000.

I’ve gathered stats that include histograms on n1 and n2 (separately) and I’ve created indexes on n1 and n2 (separately). As a result the ideal path for this query is to use the index on n1 to find rows for the first of the two compound predicates and use the index on n2 to find rows for the second of the predicates, which should be possible if the optimizer first transforms the query using OR-expansion.

You’ll notice I’ve included the hint to capture rowsource execution statistics, so I’ll be executing this query with various hints and reporting the actual execution plans and workload. Using 19.11.0.0 and 21.3.0.0 with no special parameter settings the execution plan that appeared used B-tree/bitmap conversion:

| Id  | Operation                           | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |    45 (100)|      2 |00:00:00.01 |      50 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |    45   (3)|      2 |00:00:00.01 |      50 |
|   2 |   BITMAP CONVERSION TO ROWIDS       |       |      1 |        |            |      2 |00:00:00.01 |      48 |
|   3 |    BITMAP OR                        |       |      1 |        |            |      1 |00:00:00.01 |      48 |
|   4 |     BITMAP AND                      |       |      1 |        |            |      1 |00:00:00.01 |      24 |
|   5 |      BITMAP CONVERSION FROM ROWIDS  |       |      1 |        |            |      1 |00:00:00.01 |       2 |
|*  6 |       INDEX RANGE SCAN              | T1_N1 |      1 |        |     1   (0)|      1 |00:00:00.01 |       2 |
|   7 |      BITMAP CONVERSION FROM ROWIDS  |       |      1 |        |            |      1 |00:00:00.01 |      22 |
|*  8 |       INDEX RANGE SCAN              | T1_N2 |      1 |        |    21   (0)|  10001 |00:00:00.01 |      22 |
|   9 |     BITMAP AND                      |       |      1 |        |            |      1 |00:00:00.01 |      24 |
|  10 |      BITMAP CONVERSION FROM ROWIDS  |       |      1 |        |            |      1 |00:00:00.01 |       2 |
|* 11 |       INDEX RANGE SCAN              | T1_N2 |      1 |        |     1   (0)|      1 |00:00:00.01 |       2 |
|  12 |      BITMAP CONVERSION FROM ROWIDS  |       |      1 |        |            |      1 |00:00:00.01 |      22 |
|* 13 |       INDEX RANGE SCAN              | T1_N1 |      1 |        |    21   (0)|  10001 |00:00:00.01 |      22 |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   6 - access("N1"=1)
   8 - access("N2"=10000)
  11 - access("N2"=1)
  13 - access("N1"=10000)

This is a fairly clever plan but not what I wanted to test so I set the hidden parameter ‘_b_tree_bitmap_plans’ to false for all subsequent tests. With this block in place the plan changed to a full tablescan:

-------------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |    99 (100)|      2 |00:00:00.01 |     349 |
|*  1 |  TABLE ACCESS FULL| T1   |      1 |      1 |    99   (2)|      2 |00:00:00.01 |     349 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((("N1"=1 AND "N2"=10000) OR ("N1"=10000 AND "N2"=1)))


Definitely not what I wanted – so I added a hint telling the optimizer I wanted to see OR-expansion. The optimizer produced the same full tablescan! Since I had included the format option ‘hint_report’ in my call to dbms_xplan.display_cursor() I can show you the extra lines of output that explained why the optimizer “ignored” my hint:

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (U - Unused (1))
---------------------------------------------------------------------------
   1 -  SEL$1
         U -  or_expand(@sel$1 (1) (2)) / No valid predicate for OR expansion

As you can see the hint was not “N – unresolved” or “E – Syntax error”. It was recognised, syntactically correct, notionally applicable but unused because the optmizer couldn’t see a way to use it (even though we can see an obvious way to use it).

Idle curiosity then prompted me to try the use_concat() hint, in the form: “use_concat(@sel$1 1)” – here’s the resulting execution plan:

---------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |       |      1 |        |     4 (100)|      2 |00:00:00.01 |       7 |
|   1 |  CONCATENATION                       |       |      1 |        |            |      2 |00:00:00.01 |       7 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       4 |
|*  3 |    INDEX RANGE SCAN                  | T1_N2 |      1 |      1 |     1   (0)|      1 |00:00:00.01 |       3 |
|*  4 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|*  5 |    INDEX RANGE SCAN                  | T1_N1 |      1 |      1 |     1   (0)|      1 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"=10000)
   3 - access("N2"=1)
   4 - filter(("N2"=10000 AND (LNNVL("N2"=1) OR LNNVL("N1"=10000))))
   5 - access("N1"=1)

Exactly the plan I wanted to see from or_expand(), although the two subqueries are in the reverse order to the order I would expect from or_expand(). So the new cost-based or-expansion says there’s no valid predicate available for expansion, but the old, deprecated, heuristic, concatenation transformation manages to find a disjunct (OR) that can be expanded.

Of course the next thing to do is look at the predicted cost and actual work (mostly buffer gets) that Oracle reported for each plan:

  • bitmap conversion: (cost 45, buffers 50)
  • full tablescan: (cost 99, buffers 349)
  • concatenation: (cost 4, buffers 7)

The predicted costs are actually fairly consistent with buffer gets (which, if I flushed the cache, would also be mostly disk reads). I had been fairly impressed that the optimizer picked bitmap conversion, but it would have been so much better if the optimizer could see that this (slightly complex) set of predicates included an opportunity for or-expansion.

Footnote 1

This query shows an example of disjunctive normal form (DNF), i.e the where clause is a disjunct (OR) of conjuncts (ANDs). I understand that optimizers (in general) quite like this form, but there is another “nice” form which is CNF (conjunctive normal form) i.e. where the where clause is a conjuct (AND) of disjuncts (ORs). So, for entertainment, I rewrote the where clause in conjunctive normal form. You have to be a little careful when you play the “normal form” game, it’s quite easy to get it wrong, so here are the steps I took (using A, B, C, D instead of my 4 atomic predicates):

(A and B) or (C and D) ==
        (A or (C and D)) and (B or (C and D)) ==               -- distributing the (A and B)
        (A or C) and (A or D) and (B or C) and (B or D)        -- distributing the two occurrences of (C and D)

Here’s the restulting query and unhinted execution plan after substituting “n = 1” etc. back into the symbolic presentation (and it probably gives you some idea why I played safe by starting with A, B, C, D):

select  /*+ gather_plan_statistics */
        n1, n2, small_vc
from
        t1
where
        (n1 = 1 or n2 = 1) 
and     (n1 = 1 or n1 = 10000) 
and     (n2 = 10000 or n2 = 1)
and     (n2 = 10000 or n1 = 10000)
;

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                 |      1 |        |     4 (100)|      2 |00:00:00.01 |       7 |
|   1 |  VIEW                                 | VW_ORE_BA8ECEFB |      1 |      2 |     4   (0)|      2 |00:00:00.01 |       7 |
|   2 |   UNION-ALL                           |                 |      1 |        |            |      2 |00:00:00.01 |       7 |
|*  3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1              |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       4 |
|*  4 |     INDEX RANGE SCAN                  | T1_N1           |      1 |      1 |     1   (0)|      1 |00:00:00.01 |       3 |
|*  5 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1              |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|*  6 |     INDEX RANGE SCAN                  | T1_N2           |      1 |      1 |     1   (0)|      1 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("N2"=10000)
   4 - access("N1"=1)
   5 - filter(("N1"=10000 AND LNNVL("N1"=1)))
   6 - access("N2"=1)

It’s the OR-expansion I wanted to see.

If I can do an algorithmic rewrite that produces the desired plan the optimizer can be coded to do the rewrite – so I think you can expect to see this limitation removed at some future point. This plan, however, did still depend on my disabling B-tree/bitmap conversion; when I enabled B-tree/bimap conversion the optimizer used it to produce the following plan:

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |     2 (100)|      2 |00:00:00.01 |       6 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |
|   2 |   BITMAP CONVERSION TO ROWIDS       |       |      1 |        |            |      2 |00:00:00.01 |       4 |
|   3 |    BITMAP OR                        |       |      1 |        |            |      1 |00:00:00.01 |       4 |
|   4 |     BITMAP CONVERSION FROM ROWIDS   |       |      1 |        |            |      1 |00:00:00.01 |       2 |
|*  5 |      INDEX RANGE SCAN               | T1_N1 |      1 |        |     1   (0)|      1 |00:00:00.01 |       2 |
|   6 |     BITMAP CONVERSION FROM ROWIDS   |       |      1 |        |            |      1 |00:00:00.01 |       2 |
|*  7 |      INDEX RANGE SCAN               | T1_N2 |      1 |        |     1   (0)|      1 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((INTERNAL_FUNCTION("N1") AND INTERNAL_FUNCTION("N2") AND ("N2"=10000 OR "N1"=10000)))
   5 - access("N1"=1)
   7 - access("N2"=1)

The thing to note in this case, though, is that the B-tree/bitmap conversion is logically the correct thing to choose when you compare the estimated cost and actual workload:

  • or-expansion: (cost 4, buffers 7)
  • bitmap conversion: (cost 2, buffers 6)

Footnote 2

Mohamed Houri wrote an article on Or-expansion a year ago explaining the possible settings for the hidden parameter “_optimizer_cbqt_or_expansion”, which can off, on, linear, greedy or two_pass. I tried all the options to see if that would make any difference (apart from the obvious impact of “off”); but it didn’t.

Source code

If you want to do further experiments, here’s the script I used to generate the data:

rem
rem     Script:         concat_3b.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2008 / Jan 2003
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem             21.3.0.0
rem

create table t1
as
with generator as (
        select
                rownum  id
        from    dual
        connect by level <= 10000
)
select
        rownum                  n1,
        10000                   n2,
        lpad(rownum,10,'0')     small_vc,
        rpad('x',100)           padding
from
        generator       v1
;

insert /*+ append */ into t1
select
        n2, n1, small_vc, padding
from
        t1
;

commit;

create index t1_n1 on t1(n1);
create index t1_n2 on t1(n2);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for columns size 100 n1, n2'
        );
end;
/


January 28, 2023

Case Study

Filed under: Oracle,Troubleshooting — Jonathan Lewis @ 12:46 pm GMT Jan 28,2023

This example appeared quite recently on one of the public Oracle Forums, and the person who posted it had already worked out (the important bit of) what was going on before anyone asked for more information or supplied any suggestions for action (although there was one response pointing to a MOS Note that might have had some relevance and did contain some interesting, though unrelated, background information).

The title of the question was: “KTSJ process running select dbms_rowid.rowid_type(rowid) causing huge load”, and the opening post contain a number of bits of information about what KTSJ was and how it relates to SMCO. The critical details, though, were the SQL, and what it was running on.

The system was a 4 node RAC, running 19.12. The problem query was running multiple times between 4:00 a.m. and 9:00 a.m. and the table it references is about 4 GB and “not fragmented”. The table is “basic compressed” and partitioned on a date column, there are no LOB columns, and there are about 200,000 updates on the table “each morning” (but no comment about whether this is before, during, or after the performance problem). The query was simple:

select dbms_rowid.rowid_type(rowid) from <owner>.<table_name> where rownum=1;

In a follow-up post we got the following information about a typical run of the query:

Global Stats
============================================================================
| Elapsed |    Cpu  |      IO  |  Cluster | Fetch | Buffer |  Read |  Read |
| Time(s) | Time(s) | Waits(s) | Waits(s) | Calls |   Gets |  Reqs | Bytes |
============================================================================
|     123 |      16 |       94 |       13 |     1 |     1M | 81215 |  10GB |
============================================================================

If you’re familiar with the the SQL Monitor you’ll recognise this as the summary information from an SQL Monitor report – which means, of course, that the OP also has an execution plan for the query.

It’s easy of course to do the Bart Simpson “I knew that” claim after the OP has given the answer, but even with this tiny amount of information there are several clues to tell you what’s probably going on and why.

Inferences

In this case the query is so simple there are virtually no options for the optimizer. It’s a reference to a single table, it’s asking only for the rowid, and it’s going to stop after one row is returned – there are only three possible (unhinted) execution plans: a tablescan, an index fast full scan, or an index full scan – all with a stopkey. In fact, it’s only in earlier versions of Oracle that the optimizer would choose an index full scan over an index fast full scan.

The idea of a segment scan (whether tablescan or index fast full scan) may sound a bit extreme as a path to return just one row, but the optimizer knows that it’s going to stop after one row and adjusts the cost accordingly. Since indexes on a table are (almost always) smaller than the table itself we will probably see an index fast full scan on whichever index reports the smallest number of leaf blocks (and included at least one column declared not null), and we should see the scan stopping after a very small number of block visits.

The Global Stats tell us a completely different story. We might be expecting one multiblock read of 128 blocks, and possibly a few more single block reads for read-consistency (undo records applied) reasons; we’ve actually done 81,215 Read Requests totalling about 10GB (which really means 9.5GB to 10.5GB) or roughly 1.3 million blocks. This average read size of fractionally over 16 suggests a huge segment scan that didn’t stop early. As a quick sanity check you’ll note we’ve also reported 1M Buffer Gets (which means 0.5M to 1.5M) which is consistent with a segment scan into the buffer cache.

  • Question: Why would Oracle do a huge segment scan when we know it should have stopped scanning almost immediately?
  • Answer: there was a huge amount of empty space at the “start” of the segment.

One detail the OP told us was that the table was only 4GB and “not fragmented” – I don’t think you’d say that if there were 10GB of empty blocks in the table before a scan got to the first row. On the other hand, the OP didn’t say anything about the size or state of any indexes and we’re expecting an index fast full scan of the “smallest” index on the table.

So how does the smallest index on a 4GB table turn into a 10GB fast full scan? The answer is in two parts:

  • First is that the optimizer is looking at user_indexes.leaf_blocks to calculate the cost of a fast full scan.
  • Second is that the code to generate the value of leaf_blocks doesn’t see empty leaf blocks
  • Third is that there are patterns of data processing that can leave an index in a state where it holds a huge number of empty leaf blocks that haven’t been flagged for re-use.

The OP said there were about 200K updates on the table each morning – what if the activity was something like:

  • Insert a batch of rows with a status column holding a starting value
  • The column is indexed to allow rapid access to “new” (and other interesting, “rare”) status values
  • The column cycles through a few different status values before getting to a “final” status.

Depending on batch sizes, timing, concurrency effects, version of Oracle, and the possible values for status you could end up with an index that had a huge number of empty leaf blocks at the “low value” end and all the populated leaf blocks at the “high value” end. It’s not supposed to happen, but occasionally it does. If you gathered index stats Oracle would count only the blocks that held the few high values but when you did an index fast full scan you’d have to scan through a load of empty leaf blocks before you found one with an index entry.

Summary

Realistically the correct diagnostic method for problem would have been to check the plan (and activity) from the SQL Monitor report, notice the index fast full scan and (if they were visible) multiblock read waits, then check the state of the index. This note was just a little lesson on the way in which you can make reasonable guesses (inferences / deductions) from insufficient data so that you have some idea of what might have gone wrong.

In this case the steps were:

  1. Know that there are (realistically) only two possible execution paths for the stated query
  2. Do some arithmetic to recognise the activity as a large segment scan
  3. Eliminate the tablescan because the description of its state doesn’t allow for the observed phenomenon
  4. Recognise the potential for an index to get into a state that matches the phenomenon

Lagniappe

The OP had already replied to his own question explaining that he had identified an index that had grown to 13GB (when the table was only 4GB), and that the index was only 0.03GB (30MB) after a rebuild, and the problem disappeared.

This suggests a need for two further investigations:

  • Why does the index grow so much, and does it need to be coalesced or shrunk on a very regular basis (e.g. daily, or once per week).
  • A average table row is 133 (= 4 / 0.03) times as long as its index entry. That sounds like the rows may be rather long. (At least 1,470 bytes since the minimum Oracle allows for an index entry is 11 bytes). Has the tables pctfree been set for optimum storage, and might the table be suffering from lots of migrated rows?

January 26, 2023

Indexing foundations

Filed under: Indexing,Oracle — Jonathan Lewis @ 6:08 pm GMT Jan 26,2023

Here are the introductory comments I made at a recent “Ask me Anything” session about indexing arranged by the All India Oracle User Group:

There are 4 fundamental thoughts that you need to bear in mind whenever you’re thinking about what indexes your application needs:

  1. The intent of indexing is to find the data you need with the minimum use of critical resources (which may be disk I/Os for relatively small systems, CPU for large systems).
  2. High precision is important – the ideal use of indexes is to avoid visiting table blocks that don’t hold useful data. With ideal indexing, query performance is dictated by the amount of data you want to manipulate not by the total amount of data in the database.
  3. Creating precise indexes for every query requirement leads to a lot of real-time maintenance when you modify data so you need to balance the resources needed for DML against the resources for queries.
  4. Oracle offers many ways to minimise the work you need to do to use and to maintain indexes. (So you need to think about other mechanisms every time you think about adding indexes)

If you can keep these points in mind then everything else you need to think about for your specific system follows as a logical consequence.

I did consider adding one more comment about the most significant difference between B-tree and Bitmap indexes which is that individual B-tree indexes tend to be very precise while precision is only achieved with Bitmap indexes by combining them; but that’s just adding a specific example to a list of general principles and could even have the effect of deflecting people from the thoughts of combining (necesary but annoying) low-precision B-tree indexes.

January 15, 2023

Quiz Night

Filed under: Infrastructure,Oracle,Parallel Execution,Troubleshooting — Jonathan Lewis @ 6:25 pm GMT Jan 15,2023

Here’s a little extract from one of my “snap instance activity stats” packages picking out the figures where the upper case of the statistic name is like ‘%PARALLEL%’. I was very careful that nothing except one of my SQL*Plus sessions had done anything in the few seconds between the start and end snapshots so there’s no “(un)lucky timing” to explain the anomaly in these figures.

The quesion is: how can Oracle manage to claim hundreds of “checkpoint buffers written for parallel query” when there had been no parallel statements executing around the time the snapshots were taken?

Name                                                        Value
----                                                   ----------
DBWR parallel query checkpoint buffers written              1,430
queries parallelized                                            0
DML statements parallelized                                     0
DDL statements parallelized                                     0
DFO trees parallelized                                          0
Parallel operations not downgraded                              0
Parallel operations downgraded to serial                        0
Parallel operations downgraded 75 to 99 pct                     0
Parallel operations downgraded 50 to 75 pct                     0
Parallel operations downgraded 25 to 50 pct                     0
Parallel operations downgraded 1 to 25 pct                      0

Here’s a little background information if you don’t know why this is a puzzle.

When you start executiing a parallel tablescan (or index fast full scan) the first step that your session takes is to tell DBWR to write to disc any dirty blocks for the object you’re about to scan and wait for DBWR to confirm that the blocks have been written. The session does this because a parallel scan will bypass the buffer cache and read directly from disc to the PGA, so if there were any dirty blocks for the object in the buffer cache the query would not find them and get them to a correctly read-consistent state. The blocks written by DBWR in this case would (should) be reported under “DBWR parallel query checkpoint buffers written”.

Answer

It didn’t take long for the correct answer to appear in the comments. I updated a large number of rows in a table, then I set the parameter “_serial_direct_read” to “always” in my session and executed a query that did a tablescan of that table.

My session called the database writer to do an “object checkpoin”, then started waiting on event “enq: KO – fast object checkpoint” before reading the table using direct path reads of (just under) 1MB each.

I published this note because a question came up on the Oracle developer forum which (as a side note) had noted the apparent contradiction in a Statspack report, and I thought it was a useful lesson covering two important points.

First: the Oracle code keeps evolving, and the instrumentation and statistics don’t always keep up; in this case I think there’s a code path that says “fast object checkpoint”, but doesn’t have a flag that separates the newer serial scan option from the original parallel scan option – hence serial scans triggering “parallel query checkpoints”.

Secondly: when you see some performance figures that don’t seem to make sense or contradict your previous experience, it’s worth thinking around the oddity to see if you can come up with a feasible (and testable) explanation along the lines of “what if Oracle’s doing something new but calling some old instrumentation”.

Footnote

One of the comments includes some questions about how the whole parallel query / read-consistency / checkpointing works if more DML is happening as the query runs. I’ll come back to that in a couple of days.

January 11, 2023

Quiz Night

Filed under: Indexing,Oracle,Statistics — Jonathan Lewis @ 5:27 pm GMT Jan 11,2023

I may have given the answers to this little puzzle elsewhere on the blog already, but if so I can’t find where, and it’s worth a little note of its own.

I have a non-partitioned heap table that is a subset of all_objects, and I’m going to execute a series of calls to dbms_stats to see what object stats appear. When I first wrote the script Oracle didn’t gather stats automatically when you created a table, so I’ve have to add in a call to delete_table_stats immediately after creating the table.

I’ve run a little script to display various stats about the table at each stage; that part of the process is left to any readers who want to check my answers.


rem     Script:         gather_col_stats.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2015
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem             11.2.0.4
rem

drop table t1;

@@setup

create table t1 
as
select  * 
from    all_objects
where   rownum <= 50000
;


execute dbms_stats.delete_table_stats(user,'t1');
start show_simple_stats t1

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for columns created size 1')
start show_simple_stats t1

delete from t1 where rownum <= 2000;

create index t1_i1 on t1(object_id);
start show_simple_stats t1

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all indexed columns size 1')
start show_simple_stats t1

After creating the table I call delete_table_stats() – does this definitely mean there are no stats on the table?

I’ve then gathered stats on just one column of the table – what has Oracle really done?

I’ve then deleted a couple of thousand rows and created an index on the table. (Why did I do the delete?) Since this is 11g (at least) what extra stats have come into existence?

I’ve then gathered stats for just the indexed columns – what has Oracle really done?

Answers

The script I wrote to report statistics was a set of simple queries against user_tables, user_indexes, user_tab_cols, and user_tab_histograms in turn. After the call to delete_table_stats() the first reportshowed that there were no statistics about the table in any of these views.

After the gathering stats but specifying just one column the report showed the following

Table stats
===========
    BLOCKS   NUM_ROWS       RLEN  CHAIN_CNT       Sample
---------- ---------- ---------- ---------- ------------
      1009      50000        135          0       50,000

Table column stats
==================
COLUMN_NAME                            Sample     Distinct  NUM_NULLS    DENSITY HISTOGRAM          Buckets CLEN GLO LOW_VALUE                  HIGH_VALUE
-------------------------------- ------------ ------------ ---------- ---------- --------------- ---------- ---- --- -------------------------- -----------------
APPLICATION                                                                      NONE                            NO
CREATED                                50,000          532          0 .001879699 NONE                     1    8 YES 7877041101390F             78790A0E0A330F
CREATED_APPID                                                                    NONE                            NO
...

Table histogram stats
=====================
COLUMN_NAME                      ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
-------------------------------- --------------- -------------- ----------------------------------------
CREATED                                        0     2458591.04
                                               1     2459502.41

Obviously we expect to see the column stats for just the one column, and the histogram is just reporting the low and high values since we have requested “no histogram”. The interesting point, though is that the table stats show that Oracle has also derived an average row length (avg_row_len – heading rlen) – so where did that come from.

Index stats
===========

INDEX_NAME               BLEVEL LEAF_BLOCKS   NUM_ROWS DISTINCT_KEYS CLUSTERING_FACTOR       Sample    AVG_LBK    AVG_DBK
-------------------- ---------- ----------- ---------- ------------- ----------------- ------------ ---------- ----------
T1_I1                         1         107      48000         48000              1003       48,000          1          1


After deleting 2,000 rows and creating an ordinary B-tree index we get an extra entry in the stats report which is about the index stats; all the other parts of the report remain unchanged. We don’t see any column stats of the column used to define the index, but one thing we do see is that the user_indexes correctly reports 48,000 as its num_rows while the num_rows in user_tables has not been updated with this bit of information.

Little anecdote: about 20 years ago I was called in to find the root cause of a performance problem. After a 3 hour management meeting I was allowed to examine the 10053 trace the DBAs had already dumped in anticipation of my arrival – it took 10 minutes to find an Oracle bug that appeared when the num_rows for a primary key index was smaller than the num_rows for the table. (The bug was fixed in the next release). And, of course, you should remember that num_rows for an index could well be much smaller than num_rows in user_tables thanks to NULLs.

After the call to gather stats “for all indexed columns” the report produced the following results

Table stats (updated)
=====================

    BLOCKS   NUM_ROWS       RLEN  CHAIN_CNT       Sample
---------- ---------- ---------- ---------- ------------
      1009      48000        136          0       48,000


Index stats (no change)
=======================

INDEX_NAME               BLEVEL LEAF_BLOCKS   NUM_ROWS DISTINCT_KEYS CLUSTERING_FACTOR       Sample    AVG_LBK    AVG_DBK
-------------------- ---------- ----------- ---------- ------------- ----------------- ------------ ---------- ----------
T1_I1                         1         107      48000         48000              1003       48,000          1          1

Table column stats - indexed columns now set
============================================

COLUMN_NAME                            Sample     Distinct  NUM_NULLS    DENSITY HISTOGRAM          Buckets CLEN GLO LOW_VALUE                  HIGH_VALUE
-------------------------------- ------------ ------------ ---------- ---------- --------------- ---------- ---- --- -------------------------- --------------------------
APPLICATION                                                                      NONE                            NO
CREATED                                50,000          532          0 .001879699 NONE                     1    8 YES 7877041101390F             78790A0E0A330F
CREATED_APPID                                                                    NONE                            NO
..
NAMESPACE                                                                        NONE                            NO
OBJECT_ID                              48,000       48,000          0 .000020833 NONE                     1    5 YES C22547                     C3083902
OBJECT_NAME                                                                      NONE                            NO
...


Table histogram stats (indexd columns now added)
================================================

COLUMN_NAME                      ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE
-------------------------------- --------------- -------------- ----------------------------------------
CREATED                                        0     2458591.04
                                               1     2459502.41
OBJECT_ID                                      0           3670
                                               1          75601

It’s not surprising to see that the index stats haven’t changed and the indexed column now has stats. A detail of the column stats (which is still not a surprise) is that the original 50,000 sample size is there for the created column; Oracle hasn’t been asked to do anything about it, so it hasn’t done anything about it.

One thing that has changed, though, is that the table stats now show a corrected num_rows, and a change to the average row length (rlen). It’s the row length that is the critical feature I want to highlight in this little quiz. How does Oracle get this estimate? The answer is one you don’t really want to hear: it does a lot of work to calculate it – and for some systems the work done will be huge.

Here’s the SQL (pulled from the tkprof summary of a trace file, with a little cosmetic editing) that Oracle ran in 19.11 to gather “just” the stats on the created column.

select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats
  cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring
  xmlindex_sel_idx_tbl opt_param('optimizer_inmemory_aware' 'false')
  no_substrb_pad  */
  to_char(count("OWNER")),to_char(count("OBJECT_NAME")),
  to_char(count("SUBOBJECT_NAME")),to_char(count("OBJECT_ID")),
  to_char(count("DATA_OBJECT_ID")),to_char(count("OBJECT_TYPE")),
  to_char(count("CREATED")),
  substrb(dump(min("CREATED"),16,0,64),1,240),
  substrb(dump(max("CREATED"),16,0,64),1,240),
  to_char(count("LAST_DDL_TIME")),
  to_char(count("TIMESTAMP")),to_char(count("STATUS")),
  to_char(count("TEMPORARY")),to_char(count("GENERATED")),
  to_char(count("SECONDARY")),to_char(count("NAMESPACE")),
  to_char(count("EDITION_NAME")),to_char(count("SHARING")),
  to_char(count("EDITIONABLE")),to_char(count("ORACLE_MAINTAINED")),
  to_char(count("APPLICATION")),to_char(count("DEFAULT_COLLATION")),
  to_char(count("DUPLICATED")),to_char(count("SHARDED")),
  to_char(count("CREATED_APPID")),to_char(count("CREATED_VSNID")),
  to_char(count("MODIFIED_APPID")),to_char(count("MODIFIED_VSNID"))
from
 "TEST_USER"."T1" t  /* ACL,ACL,ACL,ACL,ACL,ACL,NDV,NIL,NIL,NNV,ACL,ACL,ACL,
  ACL,ACL,ACL,ACL,ACL,ACL,ACL,ACL,ACL,ACL,ACL,ACL,ACL,ACL,ACL*/

I’ve highlighted the three lines in the statement where the simple stats about the created column are derived in pure SQL, but you can also see that every single column in the table also has the number of non-null entries count, and at the end of the query you can see the “comment” code that triggers the routine to calculate the average column length (ACL) for almost all the columns.

So gathering stats for “just one” column actually does a lot of work getting lengths and counts of all the other columns in the table – and then discards them rather than storing partial stats for the user_tab_cols view.

Summary

When you try to gather stats for a single column of a table – after, for example, you’ve just added a function-based index – Oracle will have to do a lot more work than you might have realised because it want’s to update the average row length (avg_row_len) for the table, and to do that it calculates the average column length (avg_col_len) for every single column.

January 6, 2023

Case Study

Filed under: Execution plans,Oracle — Jonathan Lewis @ 4:34 pm GMT Jan 6,2023

Here’s a query that appeared on the MOS “SQL Performance” forum (needs a support account) with the request: “Can someone please assist me to optimize this query?”

It looks like a nice simple query (though looks can be deceiving) so I thought I’d use it as another example on how to think about tuning SQL.

SELECT 
        MSI.SEGMENT1, OL.SHIP_FROM_ORG_ID, 
        OL.ORDERED_QUANTITY SOLD_QTY, 
        OL.UNIT_SELLING_PRICE SELLING_PRICE
FROM 
        OE_ORDER_HEADERS_ALL   OH, 
        OE_ORDER_LINES_ALL     OL, 
        MTL_SYSTEM_ITEMS_B     MSI
WHERE 
        OH.HEADER_ID         = OL.HEADER_ID
AND     OL.SHIP_FROM_ORG_ID  = MSI.ORGANIZATION_ID
AND     OL.INVENTORY_ITEM_ID = MSI.INVENTORY_ITEM_ID
AND     NVL(oh.source_document_type_id,0) <> 10
AND     SUBSTR(SEGMENT1,4,3) = 'FIF'
AND     UPPER(OL.FLOW_STATUS_CODE) NOT IN ('ENTERED','CANCELLED')

I really dislike (and actually have trouble) reading blocks of text in upper case, so before I do anything else here’s the same code converted to (mostly) lower case, with one minor adjustment:

select
        msi.segment1, 
        ol.ship_from_org_id, 
        ol.ordered_quantity     sold_qty,
        ol.unit_selling_price   selling_price
from 
        oe_order_headers_all   oh, 
        oe_order_lines_all     ol, 
        mtl_system_items_b     msi
where 
        oh.header_id         = ol.header_id
and     ol.ship_from_org_id  = msi.organization_id
and     ol.inventory_item_id = msi.inventory_item_id
and     nvl(oh.source_document_type_id,0) <> 10
and     substr(msi.segment1,4,3) = 'FIF'
and     upper(ol.flow_status_code) not in ('ENTERED','CANCELLED')

I’ve highlighted line 15 to pick out the minor adjustment: I’ve added a table alias to the expression substr(segment1,4,3). Every column reference should include its table alias. The query was pretty good in following this guideline and it was fairly easy to work out the appropriate alias here because we can also see msi.segment1 in the select list and if there had been a column called segment1 in one of the other tables Oracle would have reported error “ORA-00918: column ambiguously defined“.

We’ve been told that the query is a “sales query” against Oracle EBS R12, and we’ve been given an execution plan – though the plan, unfortunmetly, is an image from one of the non-Oracle GUIs:

Conveniently the plan tells us that the three tables in the query really are tables and not views that hide complex subqueries; it also gives us some idea of the size of two of the tables (very big). What it doesn’t give us is any information about how and where the original predicates have been used.

Things we don’t know

  • What the query means (in business terms)
  • How long it takes the query to run
  • How long it probably should take the query to run
  • How fast the OP would like the query to run
  • How often the query will be run
  • How many rows from each table have to participate in deriving the result
  • How many rows and blocks there are in each table
  • How many distinct values there are for each of the columns in the where clause.
  • Whether there are any virtual columns (or extended stats) on the table
  • Whether there are any (numerically) skewed data distribution patterns
  • What physical patterns there might be in the rows identified by the predicates.
  • What the definitions of any available indexes are (including “function-based”)
  • Whether or not we’re allowed to change the code, or whether we have to “hint” it somehow.
  • Whether this is the only set of literal values that would appear or whether it’s generated code that allows many variations to appear.
  • Whether this is a case where a production query using bind variables is being tested with one known set of values.
  • Which version of Oracle and optimizer_features_enable / hacked optimizer parameters

Basic threats

Every single predicate that compares a column with a literal hides the column inside a function call – which means the optimizer may be “losing” important statistical information

Two of the three literal-based predicates are “negative”, i.e. “not equal” and “not in”, which is another detail that can mess up the optimizer’s arithmetic (though the specific effects may vary with version and the column statistics).

Observations and Guesswork

The optimizer cardinality estimate for upper(ol.flow_status_code) not in (‘ENTERED’,’CANCELLED’) is 76,240 with a tablescan cost of 989,658. That looks like a very small fraction of a very large table. But it’s possible that this is the standard estimate for “unknown value not in (list of 2 items)” which would be 5% of 5% in 19c. Certainly 76,240 * 20 * 20 = 30M sounds like a suitable number of rows for a table with a tablescan cost close to a million. Possibly a better cardinality estimate would change the plan.

The name flow_status_code suggests a column that would have a small number of distinct values with an enormous data skew. If we had extended stats, or a virtual column, on upper(flow_status_code) with a histogram in place the optimizer might pick a completely different path. If the estimate became very small it might be able to choose nested loops and an indexed access path all the way through the query; if the estimate became very large it might decide to use a different join order.

The optimizer cardinality estimate for nvl(oh.source_document_type_id,0) <> 10 is 6.65M with a tablescan cost of 95,823. The selectivity for this predicate is derived as 1 – selectivity(nvl(oh.source_document_type_id,0)= 10), which is derived as 1 – selectivity( oh.source_document_type_id = 10).

At this point I could launch into all sorts of speculation about the column: the number of distinct values, the number of nulls, the existence (or not) of a histogram, a comparison between the blocks and cardinality of this “order headers” table and the estimates made above for the “order lines” table – but trying to cover all the options would be a long and tangled document, so I’ll just show you one example that might be a valid model this predicate:

  • t1 is a table with 100,000 rows
  • data_type_id is a numeric column with 80 distinct values
  • there are 1,000 rows in t1 where data_type_id is null
  • I’ve gathered stats just before executing a couple of queries

Here are the two queries, each followed by the critical line from its execution plan:

SQL> select count(*) from t1 where nvl(data_type_id,0) <> 50;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|*  2 |   TABLE ACCESS FULL| T1   | 98763 |   289K|   553   (2)| 00:00:01 |
---------------------------------------------------------------------------


SQL> select count(*) from t1 where data_type_id = 50;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|*  2 |   TABLE ACCESS FULL| T1   |  1238 |  3714 |   551   (1)| 00:00:01 |
---------------------------------------------------------------------------

A quick arithmetic check: add the two row estimates: 98,763 + 1,238 = 99,001. Allowing for rounding errors in the arithmetic that’s the 99,000 rows where data_type_id is not null.

Unfortunately the actual count from the first query is 2,793 – the optimizer’s estimate is out by a factor of more than 35; and the actual count from the second query is 97,207 – the optimizer’s estimate is out (in the opposite direction) by a factor of more than 78.

The statistics are “up to date” – except 50 is an extremely common value for data_type_id and I haven’t created a histogram on the column.

So perhaps the original query is asking for the “rare” order headers but there’s no histogram to give the optimizer any clue that 10 (the unwanted value) is an extremely common value for the document_type_id. Would a suitable histogram allow the optimizer to produce a much lower cardinality estimate and a significant change in the execution plan?

It may be significant that the table order in the from clause, and the predicate order in the where clause suggests that the original programmer may have been aiming for an execution plan that started at the oe_order_headers_all table.

Finally we come to the predicate substr(msi.segment1,4,3) = ‘FIF’. I don’t know much about EBS but I can guess what the table names oe_order_headers_all and oe_order_lines_all represent; on the other hand I can’t think what is really behind the name mtl_system_items_b, especially when the column names that join it to the order lines table make it look as if it should be called something more like oe_products_all; and there’s no way to guess what the significance of a little chunk of a column called segment1 might be.

The only thing we can infer from the execution plan is that it must be a fairly huge table since the optimizer has chosen to probe it through a nested looped 76,240 times with a total incremental cost of 152,551 (= 1,244,546 – 1,091,995) rather than doing a tablescan and hash join.

The cardinality and cost estimates show a couple of interesting details. First, there’s a clear error in the cardinality estimate of the nested loop join since but the number of rows produced by the join is (very slightly) larger than the number of rows supplied to it by the hash join, even though it’s accessing the table by a unique scan of a unique index.

Secondly it’s worth commenting on the cost of the indexed access, which might seem low at only 2 when I’ve claimed it’s a “fairly huge” table which would presumably require traversing an index with a blevel of 2 (root, branch, leaf) before visiting the required table block – so a cost of 4 for the table visit would seem reasonable. In fact 4 is what we’d see for a non-unique index (on this data) and a query for “id = {constant}”; the optimizer has substracted one for the uniqueness, and one because it’s a nested loop join.

If we know that there are only a very few rows where substr(segment_1,4,3) = ‘FIF’, and if we think that this is a good starting table for the optimizer then we need to create a virtual column (or extended stats) and might need to support that with a histogram. Even then, of course, it might not be a good table to have first in the join order.

Joins

So far we’ve only been considering ways to improve the optimizer’s cardinality estimates in the hope that better information would give it a better execution plan. We believe that all three tables are very large and suspect that if the actual volume of relevant data is small we can get a better path that uses nested loops from beginning to end. But if we want that path to be efficient we’re going to need suitable indexes, including a precision index into the first table in the join.

What we need to do now is consider the resources that might be needed to ensure we have the “perfect” indexes for the optimum path, compared with the excess resources that would be used if we could find a “good enough” path.

Looking at the query, my feeling is that there are two possible paths that might work reasonably well (assuming the required volume of data is sufficiently small):

  • oe_order_headers_all -> or_order_lines_all -> mtl_system_items_b
  • oe_order_lines_all -> or_order_headers_all -> mtl_system_items_b

I suspect (in the absence of any supplied information) that a path starting with mtl_system_items_b will do too much random I/O into the order lines tables. (It’s a standard example of the problem displayed by pairings like: products -> order_lines and customers -> orders the order lines for any given product are likely to be be scattered widely across the table, as are the orders for any particular customer. Starting with mtl_system_items_b might do a lot of random I/O before discarding a lot of unwanted order lines.

Looking at the oe_order_lines_all table it strikes me that only a small number of rows will be newly entered or cancelled, and most of them will be in states like “completed”, “picked” invoiced”, etc, so the predicate on flow_status_code is probably not one that will eliminate a lot of data, so I’m just going to talk about the options for getting into the oe_order_headers_all table. Assuming the predicate “nvl(oh.source_document_type_id,0) <> 10” does identify a “small enough” number of rows then we probably have the a good enough index (the foreign key index – which probably should be created with a degree of compression and will probably also be the primary key) into oe_order_lines_all, and we know we have a unique index from there into mtl_system_items_b.

So how do we access that small number of rows as efficiently as possible with a minimum of overhead. We’ve set up a histogram on source_document_type_id so that the optimizer gets a better idea of the number of rows – but that still leaves us with a tablescan of a huge table unless we create a tailored index. Here’s an example of the type of approach we can take – based on a table t2 that has 50,000 rows of which almost all have a document_type of ‘XX’, which we want to ignore, and a few rows where the document_type is null which have to be included in the ones we keep:

SQL> execute dbms_stats.gather_table_stats(user,'t2',method_opt =>'for all columns size 1 for columns size 254 document_type')

SQL> select count(*) from t2 where nvl(document_type,'YY') <> 'XX';

  COUNT(*)
----------
       300

Execution Plan
----------------------------------------------------------
Plan hash value: 3321871023

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   278   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| T2   |   300 |  1200 |   278   (2)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(NVL("DOCUMENT_TYPE",'YY')<>'XX')

In keeping with the original supplied code I have a predicate which translates NULLs into a value that isn’t the ‘XX’ that we want to exclude. With the histogram in place the optimizer expresion has actually got exactly the right estimate.

So here’s a possible index definition that will allow us to create a very small index that identifies exactly those rows as efficiently as possible:

SQL> create index t2_i1 on t2 (case when nvl(document_type,'YY') <> 'XX' then 1 end);

SQL> execute dbms_stats.gather_table_stats(user,'t2',method_opt =>'for all hidden columns')

SQL> select count(*) from t2 where case when nvl(document_type,'YY') <> 'XX' then 1 end = 1;

  COUNT(*)
----------
       300

Execution Plan
----------------------------------------------------------
Plan hash value: 3503408237

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |     2 |     1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |       |     1 |     2 |            |          |
|*  2 |   INDEX RANGE SCAN| T2_I1 |   300 |   600 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(CASE  WHEN NVL("DOCUMENT_TYPE",'YY')<>'XX' THEN 1 END =1)

Two things to note; first, after I’ve created the “function-based” index I’ve gathered stats on “all hidden columns”. This is one way of gathering stats on the system-generated, invisible, virtual column that supports the index, and is a step that is often missed until the stats are gathered overnight by the automatic stats collection job at some random point in the future. Secondly I’ve had to change the query so that the predicate I use is an exact match for the index definition; this is why it’s often nicer to create a virtual column for the expression and index the virtual column – and to keep 3rd party or legacy code safe it’s often a good idea to declare such columns invisible.

Finally, just to show the efficiency of this strategy, here are a couple of the interesting stats about the index:

SQL> select table_name, index_name, num_rows , leaf_blocks from user_indexes where table_name = 'T2';

TABLE_NAME                INDEX_NAME             NUM_ROWS LEAF_BLOCKS
------------------------- -------------------- ---------- -----------
T2                        T2_I1                       300           1


The table has 50,000 rows, but the index has only 300 entries (that fit in a single leaf block) as the expression is NULL for the other 49,700. As an important bonus it’s fairly safe to assume that there won’t be any code in the system that will decide to use this strangely defined index when it should be using some other index.

Conclusion

We started with a query that was “too slow”. It contained some predicates that would hide any useful statistical information from the optimizer. In two of the three cases we could give the optimizer some useful statistics by creating virtual columns or extended stats on the expressions; and in the third case the structure of the expression was a special “nvl()” case that could simply have been missing a histogram on the underlying column.

Some of the information in the execution plan gave us clues about the scale of the tables – including the table which was accessed by a unique index – but on-site DBAs wouldn’t need to make guesses about some of the numbers I came up with, they could simply query the data dictionary or, for some details, query the data directly.

Reviewing table and column names, and making some assumptions about some of the data distributions (again details that could be extracted from the actual data), I picked a path that would probably be suitable if the required volume of data was relatively small, then demonstrated how we could add in an efficient index that would make this query as efficient as needed without requiring much overhead in index maintenance and without introducing the risk of other queries changing execution plans to use this new index.

December 5, 2022

Row Migration

Filed under: 19c,fragmentation,Infrastructure,Oracle — Jonathan Lewis @ 12:11 pm GMT Dec 5,2022

This is nothing more than a basic update of a note that I wrote 8 years ago. The update was triggered by a brief comment made by Martin Widlake at the recent UKOUG annual conference “Breakthrough 2022” in Birmingham. In his presentation on “wide tables”, he mentioned row migration and the possible effects of a row having to migrate many times as it grew and the possibility (of which he was not certain as he had only a vague memory of hearing the claim at some unspecified time in the past) that it might leave a “long chain of pointers” from the header to the final location of the migrated row.

It occurred to me that the vague memory might have been due to my blog note from 2014 explaining that this doesn’t happen. If a row migrates (i.e. the whole row gets moved to a new location leaving only a header behind pointing to the new location) then at a future point in time it might migrate to a 3rd (or 4th or more) location and update the header pointer, or it might actually migrate back to the original location if space has since become available.

The following script (originally created on 10gR2, but updated for 19.11 and modified for ease of retesting) builds a table, performans a series of updates on a row, and dumps the header block after each update.

rem
rem     Script:         row_migration.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2014 / Dec 2022
rem
rem     Last tested 
rem             10.2.0.5
rem             19.11.0.0
rem

create table t1 (
        id      number(6,0),
        v1      varchar2(1200)
)
pctfree 0
;


insert  into t1 
select  rownum - 1, rpad('x',100) i
from    all_objects i
where   rownum <= 75
;

commit;

prompt  =========================================
prompt  Get the relative file and block number of 
prompt  the block that hold 74 of the 75 rows, 
prompt  then dump the block to the trace file.
prompt  =========================================

column rel_file_no new_value m_file_no
column block_no    new_value m_block_no

select 
        dbms_rowid.rowid_relative_fno(rowid)    rel_file_no, 
        dbms_rowid.rowid_block_number(rowid)    block_no,
        count(*)                                rows_starting_in_block
from 
        t1
group by 
        dbms_rowid.rowid_relative_fno(rowid), 
        dbms_rowid.rowid_block_number(rowid) 
having
        count(*) = 74
;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D1_start';
alter system dump datafile &m_file_no block &m_block_no;

prompt  ======================================
prompt  Make the first row migrate and show it
prompt  ======================================

update t1 set v1 = rpad('x',400) where id = 0;
commit;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D2_migrate';
alter system dump datafile &m_file_no block &m_block_no;


prompt  ======================================
prompt  Fill the block the long row is now in,
promtp  the make it migrate again
prompt  ======================================

insert  into t1 
select  rownum + 75, rpad('x',100) 
from    all_objects 
where   rownum <= 75
;
commit;

update t1 set v1 = rpad('x',800) where id = 0;
commit;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D3_migrate_more';
alter system dump datafile &m_file_no block &m_block_no;

prompt  =======================================================
prompt  Fill the block the long row is in and shrink the row 
prompt  to see if it returns to its original block. (No).
prompt  =======================================================

insert  into t1 
select  rownum + 150, rpad('x',100) 
from    all_objects 
where   rownum <= 75
;
commit;

-- delete from t1 where id between 1 and 20;
-- commit;

update t1 set v1 = rpad('x',50) where id = 0;
commit;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D4_shrink_row';
alter system dump datafile &m_file_no block &m_block_no;


prompt  ==============================================
prompt  Make a lot of space in the original block then
prompt  GROW the row again to see if it migrates back.
prompt  ==============================================

delete from t1 where id between 1 and 20;
commit;

update t1 set v1 = rpad('x',1200) where id = 0;
commit;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D5_forcemigrate';
alter system dump datafile &m_file_no block &m_block_no;

When the script has run there will be 5 trace files, and single “grep” command to find the row entry in the dump for the first row of the block (row 0) will give you results like the following:

[oracle@linux19c trace]$ grep -A+3 "row 0" *19012*.trc

or19_ora_19012_D1_start.trc:tab 0, row 0, @0xab
or19_ora_19012_D1_start.trc-tl: 106 fb: --H-FL-- lb: 0x1  cc: 2
or19_ora_19012_D1_start.trc-col  0: [ 1]  80
or19_ora_19012_D1_start.trc-col  1: [100]
--
or19_ora_19012_D2_migrate.trc:tab 0, row 0, @0xab
or19_ora_19012_D2_migrate.trc-tl: 9 fb: --H----- lb: 0x2  cc: 0
or19_ora_19012_D2_migrate.trc-nrid:  0x090000ac.1
or19_ora_19012_D2_migrate.trc-tab 0, row 1, @0x115
--
or19_ora_19012_D3_migrate_more.trc:tab 0, row 0, @0xab
or19_ora_19012_D3_migrate_more.trc-tl: 9 fb: --H----- lb: 0x1  cc: 0
or19_ora_19012_D3_migrate_more.trc-nrid:  0x090000b0.7
or19_ora_19012_D3_migrate_more.trc-tab 0, row 1, @0x115
--
or19_ora_19012_D4_shrink_row.trc:tab 0, row 0, @0xab
or19_ora_19012_D4_shrink_row.trc-tl: 9 fb: --H----- lb: 0x2  cc: 0
or19_ora_19012_D4_shrink_row.trc-nrid:  0x090000b0.7
or19_ora_19012_D4_shrink_row.trc-tab 0, row 1, @0x115
--
or19_ora_19012_D5_forcemigrate.trc:tab 0, row 0, @0x4b9
or19_ora_19012_D5_forcemigrate.trc-tl: 1208 fb: --H-FL-- lb: 0x2  cc: 2
or19_ora_19012_D5_forcemigrate.trc-col  0: [ 1]  80
or19_ora_19012_D5_forcemigrate.trc-col  1: [1200]

  • The D1 trace shows you the row with a column count (cc) of 2, and the two column lengths.
  • The D2 trace shows you a column count of zero, and a nrid (next rowid) pointing to row 1 (2nd row) of block 0x090000ac.
  • The D3 trace shows you a column count of zero, and a nrid pointing to row 7 (eighth row) of block 0x090000b0, the row has moved to a new location and the header is pointing directly to the new location.
  • The D4 trace shows exactly the same output – after shrinking (even to a length that is less than it started at) the row has not moved back to the original location.
  • The D5 trace shows that the row has now moved back to its original location, even though it is now several hundred bytes longer than it used to be.

If you’re wondering why the row didn’t move back after shrinking at D4 (and even when I made a lot of space available in the original block the shrink didn’t cause a move), remember that Oracle tries to be “lazy” – the update can take place in situ, so Oracle doesn’t waste time and effort checking the original block.

Footnote

This note makes no claims about what might happen in a more complex case where a row is so long that it splits into multiple row pieces and the pieces end up scattering across multiple blocks. There are a couple of variations on that problem that might be worth investigating if you suspect that there is some behaviour of very wide tables or very long rows that is the source of a performance problem relating to excessive buffer gets or db file sequential reads.

November 28, 2022

Hakan Factor

Filed under: Infrastructure,Oracle,Performance — Jonathan Lewis @ 3:14 pm GMT Nov 28,2022

There’s a question on the MOSC forum (needs an account) at present that started with the performance of the datapump API over a database link but moved on to the topic of how to handle a scenario that I’ve described in the past involving a table where rows are intially short and eventually become much longer and a requirement comes up to rebuild the table.

In this case the OP has to use datapump (selecting truncate as the “action on existence”) to copy the table data from one place to another rather then doing the more common ‘alter table move’ variant of rebuilding the table.

The underlying problem in this case is that:

  • the table has 84 columns made up of (pk_col1, pk_col2, flag, change_date) plus 20 groups of 4 “value” columns.
  • rows are inserted with just the four main columns and the first group of four values.
  • over time each subsequent group of 4 values in a row is updated in a separate statement

We haven’t been given numbers but a row probably ends up taking about 10 times the space it started with – and if that’s the case you would normally need to set the table’s pctfree to something like 90 to avoid getting a lot of migrated rows in the table. But that’s not the whole of the story.

Things to go wrong

If you don’t set pctfree to 90 you get lots of migrated rows. If you then do an export (expdp) in direct_path mode expdp will do a large number of single block reads following the migrated rows, and Oracle won’t cache the follow-on blocks, so you may re-read them several times in the course of reading one block in the direct path tablescan. (For cached reads the codepath for a tablescan will simply ignore the “head pointer” to a migrated row because it “knows” that it will find the whole row in some other block eventually.)

Update: the OP was using the PL/SQL API to control the import, and there didn’t seem to be an option for avoiding the direct path select at the opposite end of the link. In fact this was an omission in the PL/SQL Packages reference but there is a command line option access_method=insert_as_select and this is available in the API through a call to:

|                BEGIN    
|                        SYS.DBMS_DATAPUMP.SET_PARAMETER(
|                                handle => :JOBHNDL,
|                                name   => 'data_access_method',
|                                value  => 'INSERT_AS_SELECT'
|                        );
|                END;

My thanks to Daniel Overby Hansen for pointing this out and arranging for an update to the documentation for 23c.

If you do set pctfree to 90 then when you rebuild the table (or recreate it with pctfree set to 90) than you end up with a much larger table with lots of blocks that are only 10% used because most of the rows are now big and aren’t going to grow any more.

Best strategy – the Hakan factor.

Work out how many rows in their final state will fit into a block and recreate the table telling Oracle that that’s the maximum number of rows it’s allowed to put in a block. (You could also set pctfree to zero at the same time to minimise the chance of Oracle inserting fewer rows than your target.)

The devil, of course, is in the detail. Part of the devilry comes from a bug that was fixed some time as far back as 10.2.0.1. Part comes from the fact that Oracle doesn’t give us a documented API to set the magic number – we have to find a way to teach Oracle about the number or hack the data dictionary. Part, inevitably, comes from the fact that when dealing with undocumented (or barely documented) mechanisms you ought to set up some test cases to check that the latest version of Oracle behaves the same way as your previous versions of Oracle when you’re playing dirty tricks.

Part 1 – Teaching Oracle.

You may know your data so well that you can immediately say how many “full-length” rows should should fit a block. If you can’t do this you could simply create a copy of the original table structure with a pctfree of zero then copy into it a few hundred rows from the original table using a predicate to limit the selected rows to ones that would not be updated any further. For example (using the table definition supplied by the OP) you might say:

create table test_tab_clone 
pctfree 0 
as 
select  * 
from    test_tab 
where   rownum = 0
/

insert into test_tab_clone 
select  * 
from    t1 
where   rownum <= 400 
and     fourthvalue19 is not null
/

commit
/

I’m assuming in this case column “fourthvalue19” will only be non-null only if the whole of the 19th set of values is populated and all the other sets of values are populated. From the OP’s perspective there may be a more sensible way of identifying fully populated rows. You do need to ensure that the table has at least one full block otherwise some odd things can happen when you try to set the Hakan factor.

Once you’ve got a small table of full size rows a simple analysis of rows per block is the next step:

select
        rows_starting_in_block,
        count(*)        blocks
from
        (
        select
                dbms_rowid.rowid_relative_fno(rowid),
                dbms_rowid.rowid_block_number(rowid),
                count(*)                                rows_starting_in_block
        from
                test_tab_clone
        group by
                dbms_rowid.rowid_relative_fno(rowid),
                dbms_rowid.rowid_block_number(rowid)
        )
group by
        rows_starting_in_block
order by
        rows_starting_in_block
/

ROWS_STARTING_IN_BLOCK     BLOCKS
---------------------- ----------
                     3          1
                    18         22
                    19          1
                       ----------
sum                            24

Looking at these results I can see that there’s a slight variation in the number of rows that could be crammed into a block – and one block which holds the last few rows of my insert statement which I can ignore. In a more realistic case you might need to tweak the selection predicate to make sure that you’ve picked only full-size rows; or you might simply need to decide that you’ve got a choice of two or three possible values for the Hakan factor and see what the results are from using them.

With the same figures above I’d be strongly inclined to set a Hakan factor of 18. That does mean I might be “wasting” roughly 1/19th of every block (for the relatively cases where a 19th row would have fitted) but it’s better than setting the Hakan factor to 19 and finding I get roughly 1 row in every 19 migrating for 22 blocks out of 23 where I should have restricted the number of rows per block to 18; the choice is not always that straightforward.

So here’s how we now “train” Oracle, then test that it learned the lesson:

truncate table test_tab_clone;
insert into test_tab_clone select * from test_tab where rownum <= 18;
alter table test_tab_clone minimize records_per_block;

truncate table test_tab_clone;
insert into test_tab_clone select * from all_objects where rownum <= 10000;

start rowid_count test_tab_clone

ROWS_STARTING_IN_BLOCK     BLOCKS
---------------------- ----------
                    10          1
                    18        555
                       ----------
sum                           556

In the first three statments I’ve emptied the table, inserted 18 rows (I ought to check they all went into the same block, really) and set the Hakan factor.

Once the Hakan factor is set I’ve emptied the table again then populated it with the “full” data set. In fact for demo purposes I’ve copied exactly 10,000 rows so that we can see that every block (except, we can safely assume, the last one written to) has acquired exactly 18 rows.

Part 2 – applying the strategy

It’s often easy to sketch out something that looks like as if it’s exactly what you need, but there are always peripheral considerations that might cause problems and an important part of examining a problem is to consider the overheads and penalties. How, for example, is our OP going to apply the method in production.

There are two problems

  • It’s a large table, and we’re cloning it because we can’t hack directly into the data dictionary to modify the table directly. What are the side effects?
  • We want the imported export to acquire the same Hakan factor. Do we have to take any special action?

The import is the simpler problem to consider since it’s not open-ended. As far as impdp is concerned we could import “data_only” or “include_metadata”, and the “table_exists_action” could be either replace or truncate, so there are only 4 combinations to investigate.

The bad news is that none of the options behaves nicely – impdp (tested on 19.11.0.0) seems to import the data then execute the “minimize records_per_block” command when really it should transfer the Hakan factor before importing the data. So it seems to be necessary to go through the same convoluted steps at least once to precreate a target table with the desired Hakan factor and thereafter use only the truncate option for the import if you want to make the target behave in every way like the source. (Even then you will need to watch out for extreme cases if the export holds fewer rows than the value you’ve set for the Hakan factor – with the special case that if the exported table is empty the attempt by the import to set the Hakan factor raises error “ORA-28603: statement not permitted on empty tables”.)

Let’s get back to the side effects of our cloning exercise on the source table. We’ve created a copy of the original data with a suitable Hakan factor so that blocks holding “completed” rows are full and 1blocks holding “in-gransit” rows have enough space to grow to their “completed” size and there are no migrated rows – and we don’t expect to see migrated rows in the future. But it’s not the right table, and to ensure we had a complete copy we would have stopped all processing of the source table.

Could we have avoided the stoppage? Maybe we could use the dbms_redefinition package – the OP is running Standard Edition so can’t do online redefinition any other way – and use the Hakan hack mechanism on the “interim” table immediately after creating it.

If we find that the online redefinition mechanism generates too much undo and redo we’ll have to use the blocking method – but then we have to do some table renaming and worry about PL/SQL packages becoming invalid, and foreign key constraints, synonyms, views etc. being associated with the wrong table.

So even though we can sketch out with an outline strategy there are still plenty of details to worry about around the edges. To a large degree this is because Oracle has not yet made the Hakan factor a “proper” property of a table that you can explicitly set in a “move” or “create table” operation . There is a function embedded in the executable (kkdxFixTableHAKAN) that looks as if it should set the Hakan factor, and there is presumably some piece of code that sets the Hakan factor when you exectute a call to “create table for exchange”, it would be nice if there was an API that was visible to DBAs.

Summary

If you have a table where rows grows significantly over their lifetime, you ought to ensure that you’ve set a suitable pctfree for the table. But if you anticipate copying, or moving the table at any time then there’s no way to pick a pctfree that is good for all stages of the data’s lifetime.

There is a feature that you can impose on the data to avoid the problems of extreme change in row-lengths and it’s fairly straightforward to impose on a single table but there is no API available to manipulate the feature directly and if you don’t anticipate the need during the initial design stage then applying the feature after the event can be an irritating and resource-intensive operation.

Footnote

For those not familiar with it, the Hakan Factor was introduced by Oracle to allow a little extra efficiency in the compression and use of bitmap indexes. If Oracle has information about the largest number of rows that can appear in any block in a table it can minimise the number of bits needed per block (space saving) and avoid having to expand and compare unnecessarily long sequences of zero bits when comparing entries across bitmap indexes. Given their intended use it should come as no surprise that you can’t call “minimize records_per_block” for a table that has an existing bitmap index.

November 21, 2022

Row_number() sorts

Filed under: Oracle,Troubleshooting,Tuning,Upgrades — Jonathan Lewis @ 5:47 pm GMT Nov 21,2022

An email on the Oracle-L list server a few days ago described a performance problem that had appeared after an upgrade from 11.2.0.4 to 19c (19.15). A long running statement (insert as select, running parallel 16) that had run to completion in 11g using about 20GB of temporary space (with 50GB read and written) had failed after running for a couple of hours in 19c and consuming 2.5 TB of temporary space even when the 11g execution plan was recreated through an SQL Profile.

When I took a look at the SQL Monitor report for 19c it turned out that a large fraction of the work done was in an operation called WINDOW CHILD PUSHED RANK which was there to deal with a predicate:

row_number() over(partition by t.ds_no, t.c_nbr order by c.cpcl_nbr desc) = 1

Checking the succesful 11g execution, this operation had taken an input rowsource of 7 Billion rows and produced an output rowsource of 70 Million rows.

Checking the SQL Monitor report for the failed executions in 19c the “pure” 19c plan had reported 7 billion input rows, 6GB memory allocated and 1TB of temp space at the same point, the plan with the 11g profile had reported 10 billion input rows, but the operation had not yet reported any output rows despite reporting 9GB as the maximum memory allocation and 1TB as the maximum temp space usage. (Differences in row counts were probably due to the report being run for different dates.)

So, the question to the list server was: “is this a bug in 19c?”

Modelling

It’s a little unfortunate that I couldn’t model the problem in 19c at the time because my 19c VM kept crashing; but I built a very simple model to allow me to emulate the window sort and row_number() predicate in an 11g instance, then re-played the model in an instance of 21c.

For the model data I took 50 copies of the first 50,000 rows from view all_objects to produce a table of 2,500,000 rows covering 35,700 blocks and 279 MB, (55,000 blocks / 430 MB in 21c); then I ran the query below and reported its execution plan with a basic call to dbms_xplan.display_cursor():

select
        /*+ dynamic_sampling(0) */
        owner, max(object_name)
from    (
        select 
                /*+ no_merge */
                owner, object_name 
        from    (
                select 
                        owner, object_name,
                        row_number() over (partition by object_name order by object_type desc) orank 
                from 
                        t1
                )  where orank= 1
        )
group by 
        owner
order by
        owner
/

-------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |       |       |       | 29491 (100)|          |
|   1 |  SORT GROUP BY             |      |     8 |   184 |       | 29491   (9)| 00:02:28 |
|   2 |   VIEW                     |      |  2500K|    54M|       | 28532   (6)| 00:02:23 |
|*  3 |    VIEW                    |      |  2500K|   112M|       | 28532   (6)| 00:02:23 |
|*  4 |     WINDOW SORT PUSHED RANK|      |  2500K|    95M|   124M| 28532   (6)| 00:02:23 |
|   5 |      TABLE ACCESS FULL     | T1   |  2500K|    95M|       |  4821   (8)| 00:00:25 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY
              INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Oracle 21c produced the same execution plan – though the row estimate for the VIEW operations (numbers 2 and 3) was a more realistic 46,236 (num_distinct recorded for object_name) compared to the unchanged 2,500,000 from 11g. (Of course it should have been operation 4 that showed the first drop in cardinality.)

With my first build, the timings weren’t what I expected: In 21c the query completed in 3.3 seconds, in 11g it took 11.7 seconds. Most of the difference was due to a large (55MB) spill to temp space that appeared in 11g but not in 21c. This would have been because myb11g wasn’t allowed a large enough PGA, so I set the workarea_size_policy to manual and the sort_area_size to 100M, which looks as if it should have been enough to cover the 11g requirement – it wasn’t and I had to grow the sort_area_size to 190 MB before the 11g operation completed in memory, allocating roughly 155MB. By comparison 21c reported an increase of only 19MB of PGA to run the query, claiming that it needed only 4.7MB to handle the critical operation.

For comparison purposes here are the two run-time execution plans, with rowsource execution stats (which messed the timing up a little) and the column projection information;

Results for 11g

-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 29491 (100)|      8 |00:00:03.96 |   35513 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |      8 |   184 |       | 29491   (9)|      8 |00:00:03.96 |   35513 |  3072 |  3072 | 2048  (0)|
|   2 |   VIEW                     |      |      1 |   2500K|    54M|       | 28532   (6)|  28575 |00:00:04.07 |   35513 |       |       |          |
|*  3 |    VIEW                    |      |      1 |   2500K|   112M|       | 28532   (6)|  28575 |00:00:03.93 |   35513 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|    95M|   124M| 28532   (6)|   1454K|00:00:08.82 |   35513 |   189M|  4615K|  168M (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|    95M|       |  4821   (8)|   2500K|00:00:10.85 |   35513 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1) "OWNER"[VARCHAR2,30], MAX("OBJECT_NAME")[30]
   2 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30]
   3 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30], "ORANK"[NUMBER,22]
   4 - (#keys=2) "OBJECT_NAME"[VARCHAR2,30], INTERNAL_FUNCTION("OBJECT_TYPE")[19], "OWNER"[VARCHAR2,30], ROW_NUMBER() OVER ( PARTITION BY
       "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )[22]
   5 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30], "OBJECT_TYPE"[VARCHAR2,19]

It’s an interesting oddity, and possibly a clue about the excess memory and temp space, that the A-Rows column for the Window Sort operation reports 1,454K rows output when it surely ought to be the final 45,982 at that point. It’s possible to imagine a couple of strategies that Oracle might be following to do the window sort that would reasult in the excess volume appearing, and I’ll leave it to the readers to investigate that

Results for 21c

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 48864 (100)|     12 |00:00:02.98 |   54755 |  54750 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |     12 |   852 |       | 48864   (1)|     12 |00:00:02.98 |   54755 |  54750 |  5120 |  5120 | 4096  (0)|
|   2 |   VIEW                     |      |      1 |  46236 |  3205K|       | 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |       |       |          |
|*  3 |    VIEW                    |      |      1 |  46236 |  6547K|       | 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|   131M|   162M| 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |  5297K|   950K| 4708K (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|   131M|       | 15028   (1)|   2500K|00:00:00.28 |   54755 |  54750 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1; rowset=256) "OWNER"[VARCHAR2,128], MAX("OBJECT_NAME")[128]
   2 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128]
   3 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128], "ORANK"[NUMBER,22]
   4 - (#keys=2; rowset=256) "OBJECT_NAME"[VARCHAR2,128], "OBJECT_TYPE"[VARCHAR2,23], "OWNER"[VARCHAR2,128], ROW_NUMBER() OVER ( PARTITION BY
       "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )[22]
   5 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128], "OBJECT_TYPE"[VARCHAR2,23]

In this case we see the A-rows from the Window Sort meeting our expectations – but that may be a beneficial side effect of the operation completing in memory.

Given the dramatically different demands for memory for a query that ought to do the same thing in both versions it looks as if 21c may be doing something clever that 11g doesn’t do, or maybe doesn’t do very well, or maybe tries to do but has a bug that isn’t dramatic enough to be obvious unless you’re looking closely.

Modelling

Here’s a script that I used to build the test data, with scope for a few variations in testing. You’ll notice that the “create table” includes an “order by” clause that is close to the sorting requirement of the over() clause that appears in the query. The results I’ve show so far were for data that didn’t have this clause in place.

rem
rem     Script:         analytic_sort_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2022
rem
rem     Last tested
rem             21.3.0.0
rem             19.11.0.0
rem             12.2.0.1
rem             11.2.0.4
rem

create table t1 nologging 
as
select 
        ao.*
from
        (select * from all_objects where rownum <= 50000) ao,
        (select rownum from dual connect by rownum <= 50)
order by
        object_name, object_type -- desc
/

--
--      Stats collection to get histograms
--

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 254'
        );
end;
/

--
-- reconnect here to maximise visibility of PGA allocation
--

connect xxxxxxxx/xxxxxxxx

set linesize 180
set trimspool on
set tab off

-- alter session set workarea_size_policy = manual;
-- alter session set sort_area_size = 199229440;

alter session set events '10046 trace name context forever, level 8';
-- alter session set statistics_level = all;
-- alter session set "_rowsource_execution_statistics"= true;

spool analytic_sort_2

select
        /*  monitoring */
        owner, max(object_name)
from    (
        select 
                /*+ no_merge */
                owner, object_name 
        from    (
                select 
                        owner, object_name,
                        row_number() over (partition by object_name order by object_type desc) orank 
                from 
                        t1
                )  where orank= 1
        )
group by 
        owner
order by
        owner
/

select * from table(dbms_xplan.display_cursor(format=>'cost bytes allstats last projection'));

alter session set events '10046 trace name context off';
alter session set "_rowsource_execution_statistics"= false;
alter session set statistics_level = typical;
alter session set workarea_size_policy = auto;

spool off

The results I’m going to comment on now are the ones I got after running the script with the order by clause in place, then reconnecting and flushing the shared pool before repeat the second half of the script (i.e. without recreating the table).

In 11g, going back to the automatic workarea sizing the session used 37MB of memory and then spilled (only) 3MB to temp. The run time was approximately 3 seconds – which is a good match for the “unsorted” 21c run time. As with the original tests, the value reported in A-rows is larger than we would expect (in this case suspiciously close to twice the correct values – but that’s more likely to be a coincidence than a clue). Interestingly, when I switched to the manual workarea_size_policy and set the sort_area_size to 190MB Oracle said “that’s the optimum memory” and used nearly all of it to complete in memory – for any value less than that (even down to 5MB) Oracle spilled just 3 MB to disk in a one-pass operation. So it looks as if Oracle “knows” it doesn’t need to sort the whole data set, but still uses as much memory as is available to do something before it starts to get clever.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 29491 (100)|      8 |00:00:01.76 |   35523 |   2145 |    331 |       |       |          |         |
|   1 |  SORT GROUP BY             |      |      1 |      8 |   184 |       | 29491   (9)|      8 |00:00:01.76 |   35523 |   2145 |    331 |  2048 |  2048 | 2048  (0)|         |
|   2 |   VIEW                     |      |      1 |   2500K|    54M|       | 28532   (6)|  28575 |00:00:02.00 |   35523 |   2145 |    331 |       |       |          |         |
|*  3 |    VIEW                    |      |      1 |   2500K|   112M|       | 28532   (6)|  28575 |00:00:01.83 |   35523 |   2145 |    331 |       |       |          |         |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|    95M|   124M| 28532   (6)|  57171 |00:00:02.10 |   35523 |   2145 |    331 |  2979K|   768K|   37M (1)|    3072 |
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|    95M|       |  4821   (8)|   2500K|00:00:11.84 |   35513 |   1814 |      0 |       |       |          |         |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In 21c there’s essentially no difference between the sorted and unsorted tests, which suggests that with my data the session had been able to apply its optimisation strategy at the earliest possible moment rather than waiting until it had no alternative but to spill to disc.

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 48864 (100)|     12 |00:00:00.98 |   54753 |  54748 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |     12 |   852 |       | 48864   (1)|     12 |00:00:00.98 |   54753 |  54748 |  4096 |  4096 | 4096  (0)|
|   2 |   VIEW                     |      |      1 |  46236 |  3205K|       | 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |       |       |          |
|*  3 |    VIEW                    |      |      1 |  46236 |  6547K|       | 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|   131M|   162M| 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |  5155K|   940K| 4582K (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|   131M|       | 15028   (1)|   2500K|00:00:00.42 |   54753 |  54748 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Bug description: possibly

Given the way that 11g reports a very small spill to disc, which stays fairly constant in size no matter how large or small the available PGA allocation is, when the input data is sorted to help the over() clause, and given how large the spill to disc can become when the data is not sorted, I feel that Oracle has an optimisation that discards input rows early in the analytic window sort. But we also have some evidence of a flaw in the code in versions prior to 21c that means Oracle fails to re-use memory that becomes available from rows that have been discarded.

This means the OP’s problem may have been just bad luck in terms of available memory and (relatively) tiny variations in demands for space between the 11g and 19c instances perhaps due to differences in the quantity or distribution of data.

Although the impact was dramatic in this case, a query that is supposed to return 70 million rows (irrespective of how many it starts with) is an extreme case, and one that deserves a strong justification and a significant investment in time spent on finding cunning optimisation strategies.

So maybe this is a bug that doesn’t usually get noticed that will go away on an upgrade to 21c; and maybe there’s a backport and patch already available if you can find a bug number in the 21c patch release notes.

Strategy

I’ve said in the past that if you’re using analytic functions you ought to minimise the size of the data you’re processing before you apply the analytic part. Another step that can help is to make sure you’ve got the data into a (fairly well) sorted order before you reach the analytic part.

In the case of versions of Oracle prior to 21c, it also seems to make sense (if you can arrange it) to minimise the amount of memory the session is allowed to use for a sort operation, as this will reduce the CPU used by the session and avoid grabbing excess redundant memory that could be used more effectively by other sessions.

Addendum

Just before publishing I found a way of keeping my 19.11.0.0 instance alive long enough to run the tests, then also ran them on an instance of 12.2.0.1. Both versions showed the same pattern of doing a large allocation of memory and large spill to disc when the data was not sorted, and a large allocation of memory but a small spill to disc when the data was sorted.

As a little sanity check I also exported the 19c data and imported it to 21c in case it was a simple variation in the data that allwoed made 21c to operate more efficiently than19c. The change in data made no difference to the way in which 21c handled it, in both cases it called for a small allocation of memory with no spill to disc.

Next Page »

Website Powered by WordPress.com.