Oracle Scratchpad

February 1, 2023

Exchange Partition

Filed under: Indexing,Infrastructure,Oracle,Partitioning,unfinished — Jonathan Lewis @ 3:04 pm GMT Feb 1,2023

A question appeared recently in a comment on a blog note I wrote about the new (in 12c) deferred global index maintenance that allowed the execution of a drop partition to return very quickly. The question was in two parts:

  • Why is an exchange partition so slow by comparison?
  • Is there any way to make it quick?

It occurred to me that it might be possible to do something for some cases by taking advantage of the partial indexing feature that also appeared in 12c. It wouldn’r eliminate the need for index maintenance, of course, and you clearly couldn’t avoid the insert maintenance for the incoming partition completely … but you might be able to ensure that it happens only at a time you find convenient.

There’s a little more detail in my reply to the comment, but this note is a place holder for the article I’ll write if I ever find time to test the idea.

July 18, 2022

drop t/s bug

Filed under: Bugs,LOBs,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 9:17 am BST Jul 18,2022

A recent thread on the MOS database admin forum (needs an account) demonstrated a number of little issues with bugs, debugging, complex syntax, and the never-ending list of “not quite complete” code that shows up when features (in this case LOBs and partitioning) collide.

It’s a silly little thing, but one to remind you that you always have to ask “What have I forgotten?”, “Does my test suite include every case I need to test?”

In this case we had been given a model where, after creating a composite partitioned table with a LOB column using a statement that referenced 3 different tablespaces, the OP had

  1. moved one component of this complex structure to a fourth tablespace
  2. dropped the tablespace that had been defined as the holder of the moved component
  3. renamed the fourth tablespace to match the name of the dropped tablespace
  4. called dbms_metadata.get_ddl() to generate a new table definition.

The result of this was that the generated statement included a reference to a tablespace with the unexpected name of “_$deleted$51$0”.

Sidenote: The name is an example of the value stored in ts$.name when you drop a tablespace. The row in ts$ is not deleted (or marked for deletion), instead ts$.online$ is set to 3 and ts$.name is set to reflect the tablespace number with a name of the format “_$deleted${ts$.ts#}$0″.

Here, with some cosmetic changes to the “create table” statement and with a preamble to create enough tablespaces and quotas, is the model supplied by the OP (if you’re thinking of running it make sure you read it carefully first):

rem
rem     Script:         drop_ts_pt_bug.sql
rem     Author:         Jean-François56 / Jonathan Lewis
rem     Dated:          July 2022
rem
rem     Last tested 
rem             19.11.0.0
rem             11.2.0.4
rem 

connect sys/sys as sysdba

prompt  =================================================
prompt  Clearing the stage - some steps may report errors
prompt  =================================================

drop table test_user.test purge;

drop tablespace jf1 including contents and datafiles;
drop tablespace jf2 including contents and datafiles;
drop tablespace jf3 including contents and datafiles;
drop tablespace jf4 including contents and datafiles;

create tablespace jf1 datafile size 100m;
create tablespace jf2 datafile size 100m;
create tablespace jf3 datafile size 100m;
create tablespace jf4 datafile size 100m;

alter user test_user quota unlimited on jf1;
alter user test_user quota unlimited on jf2;
alter user test_user quota unlimited on jf3;
alter user test_user quota unlimited on jf4;

prompt  =======================
prompt  Connecting to test user
prompt  =======================

connect test_user/test

drop table test purge;

create table test(
       idarchive                number(10,0),
       data                     blob,
       partition_date           date,
       customer                 number(10,0),
       prefix_archive_key       varchar2(5)
)
partition by range (partition_date)
subpartition by list (customer)
(
partition p1 
        values less than (to_date('2008-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian'))
        tablespace jf1
        lob (data) store as basicfile (tablespace jf2 chunk 16384)
        (
        subpartition sp values (1) 
                tablespace jf3
                lob (data) store as basicfile (tablespace jf2)
        )
);

alter table test move subpartition sp lob(data) store as (tablespace jf4);


connect sys/sys as sysdba

drop tablespace jf2 including contents and datafiles;
alter tablespace jf4 rename to jf2;


connect test_user/test

set long 20000
set longchunksize 20000

set linesize 132
column text_line format a128

begin
        dbms_metadata.set_transform_param(dbms_metadata.session_transform,'PRETTY',             true);
        dbms_metadata.set_transform_param(dbms_metadata.session_transform,'TABLESPACE',         true);
        dbms_metadata.set_transform_param(dbms_metadata.session_transform,'SEGMENT_ATTRIBUTES', true);
        dbms_metadata.set_transform_param(dbms_metadata.session_transform,'STORAGE',            false);
end;
/

select dbms_metadata.get_ddl('TABLE','TEST') text_line from dual
/


The first thing I did after reading the posting was a quick search on MOS, using the four search terms: drop tablespace rename deleted. It was a lucky choice because on the first page of results from the Knowledge Base I found:

Renaming a Tablespace to An Already Dropped one Changes The Tablespace Name To "_$DELETED" (Doc ID 1937848.1)

Conveniently the notes in this document said: “This is caused by bug 18136584”,and that bug (and the problem) is labelled as “The bug is fixed in 12.2”. Unfortunately the OP was running 11.2.0.4, but that’s okay because the note also said: “Backport is feasible. However, A simple workaround is available.” So I thought I’d create the test case and check the workaround – which I why I’ve got the script.

It just so happened that I prepared, debugged and ran the script (without the workaround) on 19.11 before bothering to start up a VM with 11.2.0.4 – and I got the following output from my call to dbms_metadata.get_ddl():

  CREATE TABLE "TEST_USER"."TEST"
   (	"IDARCHIVE" NUMBER(10,0),
	"DATA" BLOB,
	"PARTITION_DATE" DATE,
	"CUSTOMER" NUMBER(10,0),
	"PREFIX_ARCHIVE_KEY" VARCHAR2(5)
   ) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255
  TABLESPACE "TEST_8K_ASSM"
 LOB ("DATA") STORE AS SECUREFILE (
  ENABLE STORAGE IN ROW CHUNK 8192
  NOCACHE LOGGING  NOCOMPRESS  KEEP_DUPLICATES )
  PARTITION BY RANGE ("PARTITION_DATE")
  SUBPARTITION BY LIST ("CUSTOMER")
 (PARTITION "P1"  VALUES LESS THAN (TO_DATE(' 2008-06-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255
  TABLESPACE "JF1"
 LOB ("DATA") STORE AS BASICFILE (
  TABLESPACE "_$deleted$36$0" ENABLE STORAGE IN ROW CHUNK 16384 RETENTION
  NOCACHE LOGGING )
 ( SUBPARTITION "SP"  VALUES (1) SEGMENT CREATION IMMEDIATE
 LOB ("DATA") STORE AS BASICFILE (
  TABLESPACE "JF2" )
  TABLESPACE "JF3"
 NOCOMPRESS ) )

If you check line 18 (highlighted) you’ll see that even in 19.11 you can end up with generated statement that references “deleted” tablespaces – and here’s a funny little side effect (cut-n-paste from SQL*Plus – but your deletion number will probably be different):

SQL> drop table test;
drop table test
           *
ERROR at line 1:
ORA-00959: tablespace '_$deleted$36$0' does not exist

SQL> drop table test purge;

Table dropped.

So (part of) the bug is still present in 19.11; and on the plus side that means I can examine the workaround to see what it is and how it works. This is what the note says: “Execute below command once you finished with dropping and renaming the tablespace”:

alter table <table_name> modify default attributes tablespace <old_tablespace_name>;

That’s not actually going to do anything for us – but it’s an important clue to how we might be able to fix things; it also suggests why the bug “fixed” in 12.2 isn’t quite fixed in 19c – someone missed a bit of the code path: maybe the bit about LOBs, or maybe the bit about composite partitioned tables, or maybe (very precisely) the bit about partition-level default values for LOBs in composite partitioned tables. (And who knows what might be missing if we start looking at index-organized tables, nested table, and other complicated structures.)

Taking the clue from the suggested workaround, here are three possible fixes to try:

alter table test modify default attributes tablespace jf2;

alter table test modify default attributes                  lob(data) (tablespace jf2);

alter table test modify default attributes for partition p1 lob(data) (tablespace jf2);

The third of these options is the one that “works” – and the word is in quote marks because all I mean is that the generated SQL uses JF2 as the tablespace name rather than the “deleted” tablespace name – I make no guarantee about how future behaviour might vary from past behaviour after this change, and it’s likely to depend on exactly how you’re expecting to add and move partitions and subpartitions anyway.

Follow-up

The problem (which means, possibly, the omitted code path) comes from the need for handling default storage clauses. When we’re handling composite partitioning the only segments that come into existence are subpartition segments – but you can still specify physical (and logical) storage information at the table and partition level with the inference (possibly not stated explicitly) that any table-level metadata should be the default metadata for the partition level and any partition-level metadata should be the default metadata for subpartitions. What does this mean in terms of our original table creation script and the subsequent call to dbms_metadata?

You’ll notice that I’ve highlighted lines 8 – 11 in the output above from dbms_metadata.

  • Line 8 references tablespace test_8k_assm: I didn’t include a default tablespace at the table level for the table segments, but that’s the tablespace that happened to be my default tablespace when I ran the script.
  • Lines 9 – 11 define a default LOB storage clause with no specified tablespace and using securefiles (which is the default LOB storage for 19c). Again I didn’t specify anything about a table-level LOB in my original definition.
  • The rest of the generated definition has, apart from the “deleted” tablespace, reproduced my original definition – including the 16KB declaration of chunk size for the partition and the lack of specified chunksize for the subpartition.

So questions to think about:

  • what chunk size is / would be used in the subpartition – is it silently picking up the value specified for the partition, or is it silently picking up the default for the table, or is it simply using the “absolute” default of 1 block?
  • what happens if I execute a simple “add subpartition” on the existing p1 partition? Where will the subpartition be stored and what will its storage details look like.
  • What will I see if I execute a simple “add partition” to add a new partition to the table. Will I also get a physical subpartition and if so where will it be and what will its storage clause look like.
  • What would the dbms_metadata output have looked like if I had had a table-level LOB definition that specified tablespace jf2?
  • What side effects might appear if I extended the definition to interval partitioning, with automatic list subpartitions, and inserted a row that needed a new partition and subpartition?!

Underlying all these detailed questions, of course, is the specification for the maintenance work that the DBA is expected to handle, viz:

  • what is the defined strategy for adding new partitions and subpartitions to the table,
  • what is the strategy for aging out old partitions and subpartitions.
  • are there any plans about grouping partitions into “age-related” tablespaces
  • are tablespaces going to be renamed to transport them to another database

It’s possible that the anomaly in this note only showed up because the OP was experimenting with options, and maybe the ultimate production code will be based on a strategy that means the anomaly will never appear. It’s possible that the anomaly is already in the production system but only became visible when someone decided to think about archiving out old partitions and the archival code started raising errors. Playing around with models to discover what happens is time well spent; and modelling the full production life cycle before going live is a critical activity.

Some answers

To find out what we’ve actually got from the original create table statement we can query the views (user/all/dba/cdb):

USER_TABLES
USER_PART_TABLES
USER_TAB_PARTITIONS
USER_TAB_SUBPARTITIONS

USER_LOBS
USER_PART_LOBS
USER_LOB_PARTITIONS
USER_LOB_SUBPARTITIONS

USER_SEGMENTS

For the chunk sizes we find: user_lobs.chunk = 8192, user_lob_partitions.chunk = 2 (blocks?) and user_lob_subpartitions.chunk = 16,384 (bytes?). We also have user_part_lobs.def_chunk = 1 (block?). So the explicit p1 partition chunk size has cascaded down to its subpartition.

Side note: who needs consistency! You might also want to remember that when you call dbms_space.space_usage() procedure that appeared in 11g to report the space used in a (secure file) LOB, the “blocks” input parameters actually return chunks and (for CLOBS) the “bytes” input parameters actually return character counts.

For the tablespaces we find: user_lobs.tablespace_name= TEST_8K_ASSM, user_lob_partitions.tablespace_name = _$deleted$36$0 and user_lob_subpartitions.tablespace_name = JF2. We also have user_part_lobs.def_tablespace_name is null.

What happens if I try to add a subpartition to the existing p1 partition without first applying the fix:

SQL> alter table test modify partition p1 add subpartition sp2 values(2);
alter table test modify partition p1 add subpartition sp2 values(2)
            *
ERROR at line 1:
ORA-00959: tablespace '_$deleted$36$0' does not exist

Now try again but apply the fix before adding the subpartition:

SQL> alter table test modify default attributes for partition p1 lob(data) (tablespace jf2);

Table altered.

SQL> alter table test modify partition p1 add subpartition sp2 values(2);

Table altered.

Checking the data dictionary for the effects of changing the default attribute we find that user_lob_partitions.tablespace_name is now JF2, which has then been taken on by the new subpartition.

What about adding a new partition:

SQL> alter table test add partition p2 values less than (to_date('01-Jan-2010'));

Table altered.

SQL> select partition_name, chunk , tablespace_name from user_lob_partitions order by partition_name;

PARTITION_NAME              CHUNK TABLESPACE_NAME
---------------------- ---------- ------------------------------
P1                              2 JF2
P2                              1

SQL> select lob_partition_name, subpartition_name, chunk , tablespace_name from user_lob_subpartitions order by 1,2;

LOB_PARTITION_NAME   SUBPARTITION_NAME           CHUNK TABLESPACE_NAME
-------------------- ---------------------- ---------- ------------------------------
SYS_LOB_P20609       SP                          16384 JF2
SYS_LOB_P20609       SP2                         16384 JF2
SYS_LOB_P20622       SYS_SUBP20621                8192 TEST_8K_ASSM

The new partition has no tablespace_name, but it has automatically generated a subpartition (values (default)), which has climbed the tree to the table-level to set the tablespace for the LOB, and that had defaulted to the tablespace of the table itself, which was the user default tablespace of TEST_8K_ASSM. Maybe we should have modified the “default attributes lob(data)” at some point so that the user_part_lobs.def_tablespace_name was not null.

I’ll finish with just one more comment – you’ve seen how messy things can get and how much detail could be overlooked when handling marginally complex composite partitioned table. Do you really think that interval partitioning and automatic list partitioning are really going to mean you don’t have to worry about partition maintenance code? Possibly. If you plan to have one huge tablespace for all the bits and never have to worry about backing up and restoring that tablespace you will be able to forget about all the housekeeping code, but realistically you’ll need to know how to check and change the metadata and rename, move or otherwise manipulate segments so make sure you know what’s going to happen so that you don’t have to work it out when everyone’s running around in panic mode.

November 11, 2020

Indexing partitions

Filed under: Indexing,Oracle,Partitioning,Statistics — Jonathan Lewis @ 11:35 am GMT Nov 11,2020

This is one of those notes on the “thinking about the data / indexes” theme that I like to write; it’s a draft I wrote over two and a half years ago that I’ve just rediscovered and polished slightly and refers to a “recent” question that came up on the Oracle developer Forum. It’s not intended as “The Answer” to the question, it’s a just a list of ideas and observations you’d want to think about if you had to address the problem:

There is a table t1 that is range partitioned by date, each partition represents one day, and the partitioning column c3_dt is constrained to be “date-only” (i.e. the time component is always 00:00:00) so there’s only one value per partition for the column. The table is a transaction table, and currently a single partition holds about 200 million rows. [Ed: And there seem to be a little over 200 partitions, with some variations in rows per partition.]

There are two local indexes on the table: idx1 is a unique index defined on (c1, c2, c3_dt) and idx2 is non-unique, defined on (c1, c2).

Could we drop one of these two indexes, and if so which one ?

There are three immediate considerations here:

  • One index is a prefix (i.e. same columns in the same order) of the other so, from a purely mechanical perspective, we could drop the shorter index. We would, however, have to be a little cautious because the optimizer might not want to use the longer index in places where it would use the shorter because (a) the index would physically be larger and (b) in the general case the clustering_factor of the shorter index is likely to be smaller than the clustering_factor of a longer index with a matching prefix, and the estimated cost of using an index is based largely on the leaf_block count and the clustering_factor.
  • The c3_dt column adds no value in terms of precision for accessing the table; any predicate that allows the optimizer to apply partition elimination will have the same effect whether or not the column is part of the index. The only queries that could have better performance because of the extra column would be queries involving only the columns in the index – even then you’d have to ensure that at least one column in the index was declared not null (and that might have been the c3_dt column). On the down-side the date column is adding 8 bytes per row (7 data bytes, 1 length byte) to the size of the index. So it would be nicer to drop the longer index if possible. On the up-side, since the date column is always the same value within a partition its presence at the end of the index should mean that the clustering_factor of idx1 should match that of idx2 (with, perhaps, some small variation due to random sampling).
  • You can’t drop the unique index unless it has been created “by accident”. It seems reasonably likely that the index is supporting a unique (or primary key) constraint. For a local index to be unique (or, more accurately, for a unique/primary key constraint to be supported by a local index) the constraint/index has to contain the partition key, and c3_dt is the partitioning column

Bottom line, then, is that you (probably) can’t drop idx1 because you need the uniqueness that comes with the c3_dt column but it should be possible (and would be sensible and probably fairly safe) to drop idx2 – the two-column index. Of course it would be sensible to make the index invisible for a while to see if any plans that were using idx2 refused to use idx1 instead; and it’s probably a good idea to create (or be prepared to create) column group statistics on the (c1, c2) combination so that the optimizer doesn’t lose the information about the number of distinct combinations.

It’s a pity that the index that has to be kept is the one that has a “useless” column at the end, adding 8 bytes per row (totalling roughly 1.6GB per partition) to the size of the index – however the fact that the columns holds the same value for every row in the partition offers an interesting possibility. If you re-arrange the column order to (c3_dt, c1, c2) and compress the index on the first column, would all the queries that currently use the index still be happy to use it with an “index skip scan”? Before pursuing this idea, of course, you would want to check that the leading c1 column wasn’t there to protect a foreign key constraint from the “foreign key locking” problem.

The OP supplied some statistics about the three columns, which prompt a few more thoughts:

 
Partition level stats:- 
column_name data_type num_distinct                density
----------- --------- ------------   --------------------
C1          VARCHAR2          8754    0.00127388535031847 
c2          NUMBER         1398016    7.1529939571507E-7 
C3_dt       DATE                 1    2.43469886548297E-9 

Global statistics:-  
column_name data_type num_distinct                density 
----------- --------- ------------   --------------------
c1          VARCHAR2       1597649    6.25919710775020E-7 
c2          NUMBER         1996800    5.00801282051282E-7 
C3_dt       DATE               211    0.004739336492891 

Looking at these numbers we can  make the following observations:

  • C3_DT: global num_distinct = 211 suggests 211 partitions and, at the global level, we see 1/num_distinct = density so there’s no global histogram. But at the partition level we see the density = 2.4e-9, which suggests there’s a (silly) histogram and there are 205 million rows in the partition (reverse engineering the density = 1/(2*num_rows) formula for “non-existent values). We can’t be sure without doing some experimentation what effect this inconsistency might have on the optimizer’s choice of access path when we have to handle range-based (cross-partition) queries.
  • C1: num_distinct = 8754 for the current partition and 1,597,549 for the table, and the global figure happens to be very close to 1,847,094 ( = 211 * 8754): so perhaps there’s very little (or NO) overlap in c1 values between partitions – and that’s probably going to mess up the optimizer’s estimates for queries that cross partitions anyway. It’s also an indication that whichever index we keep could be complressed on the first column, and if we move to an index (c3_dt, c1, c2) the index could be compressed on the first two columns.
  • C2: has similar values for num_distinct for both the partition and the global level, which is nice – any anomalies introduced by variation between partitions is likely to be relatively small and may be nearly harmless. I often say that the optimizer likes boring data and can do a good job with it – c2 looks like it might be a nice boring column; though it’s possible that it’s hiding a nasty secret in the low / high values that we haven’t been told.

As I said at the outset, I’m not trying to answer the question “which index should I drop?”, I’m supplying some more questions that need answers before the original question can be answered sensibly. Along the way I’ve also demonstrated the way in which we can extract further information by looking at a few numbers and doing a little arithmetic.

 

 

October 10, 2020

Interval Oddity

Filed under: CBO,Execution plans,Oracle,Partitioning — Jonathan Lewis @ 2:51 pm BST Oct 10,2020

Interval partitioning is a popular strategy for partitioning date-based data. It’s an enhanced variant of range partitioning that allows you to define a starting partition and an interval that should be used to derive the high values for all subsequent partitions – and Oracle doesn’t even have to create intervening partitions if you insert data that goes far beyond the current partition, it automatically creates exactly the right partition (with the correct high_value and correctly inferred lower boundary) for the incoming data and behaves as if the intervening partitions will become available when they’re needed at some later point in time. So no need for DBAs to work out actual partition high_values, no need to ensure that all the partitions you need out into the future have been pre-created, no more batch processes crashing with Oracle error ORA-14400: inserted partition key does not map to any partition.

But there’s a surprising difference between traditional range partitioning and the newer interval partitioning that will increase CPU usage in some cases and may (though I haven’t yet investigated this in sufficient detail to create an example) lead to variations in execution plans.

To demonstrate the difference I’m going to create two tables with the same structure and content then run the same query against them and show you the resulting execution plans. Here’s the code to create and populate the tables:

rem
rem     Script:         interval_or_range.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Oct 2020
rem
rem     Last tested 
rem             19.3.0.0
rem


create table t_interval(
        order_date      date not null,
        order_id        number(10,0) not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range (order_date) 
interval (numtoyminterval(1,'MONTH'))
subpartition by hash (order_id) subpartitions 4
        (
                partition start_p1 values less than (to_date('01-Jan-2020','dd-mon-yyyy'))
        )
;

create table t_range(
        order_date      date not null,
        order_id        number(10,0) not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range (order_date) 
subpartition by hash (order_id) subpartitions 4
        (
                partition start_p1 values less than (to_date('01-Jan-2020','dd-mon-yyyy')),
                partition start_p2 values less than (to_date('01-Feb-2020','dd-mon-yyyy')),
                partition start_p3 values less than (to_date('01-Mar-2020','dd-mon-yyyy')),
                partition start_p4 values less than (to_date('01-Apr-2020','dd-mon-yyyy')),
                partition start_p5 values less than (to_date('01-May-2020','dd-mon-yyyy')),
                partition start_p6 values less than (to_date('01-Jun-2020','dd-mon-yyyy')),
                partition start_p7 values less than (to_date('01-Jul-2020','dd-mon-yyyy')),
                partition start_p8 values less than (to_date('01-Aug-2020','dd-mon-yyyy')),
                partition start_p9 values less than (to_date('01-Sep-2020','dd-mon-yyyy'))
        )
;

insert into t_range select
        to_date('01-Jan-2020','dd-mon-yyyy') + rownum,
        rownum,
        lpad(rownum,10,'0'),
        lpad('x',100,'x')
from
        all_objects
where
        rownum <= 240 -- > comment to avoid wordpress format issue  
;

insert into t_interval select
        to_date('01-Jan-2020','dd-mon-yyyy') + rownum,
        rownum,
        lpad(rownum,10,'0'),
        lpad('x',100,'x')
from
        all_objects
where
        rownum <= 240 -- > comment to avoid wordpress format issue
;

commit;

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T_RANGE',
                method_opt  => 'for all columns size 1'
        );

        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T_INTERVAL',
                method_opt  => 'for all columns size 1'
        );
end;
/

I’ve used composite partitioned tables in this example but the same anomaly appears with simple partitioning – which you can test by commenting out the “subpartion by ..” lines. The two tables have partitions defined to hold a month’s data. I’ve inserted a few rows into most of the partitions, and ensured that I haven’t accidentally attempted to insert data that falls outside the legal range of the table with the predefined partitions .

To show that the tables are nominally identical here’s the list of partitions with their high values and number of rows:

column table_name       format a15
column partition_name   format a15
column high_value       format a80

break on table_name skip 1

select
        table_name, partition_name, num_rows, high_value
from
        user_tab_partitions
where
        table_name in ('T_INTERVAL','T_RANGE')
order by
        table_name, partition_name
;


ABLE_NAME      PARTITION_NAME    NUM_ROWS HIGH_VALUE
--------------- --------------- ---------- --------------------------------------------------------------------------------
T_INTERVAL      START_P1                 0 TO_DATE(' 2020-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10722              30 TO_DATE(' 2020-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10723              29 TO_DATE(' 2020-03-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10724              31 TO_DATE(' 2020-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10725              30 TO_DATE(' 2020-05-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10726              31 TO_DATE(' 2020-06-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10727              30 TO_DATE(' 2020-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10728              31 TO_DATE(' 2020-08-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                SYS_P10729              28 TO_DATE(' 2020-09-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA

T_RANGE         START_P1                 0 TO_DATE(' 2020-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P2                30 TO_DATE(' 2020-02-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P3                29 TO_DATE(' 2020-03-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P4                31 TO_DATE(' 2020-04-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P5                30 TO_DATE(' 2020-05-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P6                31 TO_DATE(' 2020-06-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P7                30 TO_DATE(' 2020-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P8                31 TO_DATE(' 2020-08-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA
                START_P9                28 TO_DATE(' 2020-09-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIA


18 rows selected.

With this setup we can now run a simple query against the two tables using a where clause that is an exact match for the partition definition for the July data:

set serveroutput off

select 
        count(*) 
from 
        t_range 
where   order_date >= date '2020-07-01'  -- to_date('01-Jul-2020','dd-mon-yyyy')
and     order_date <  date '2020-08-01'  -- to_date('01-Aug-2020','dd-mon-yyyy')
/

select * from table(dbms_xplan.display_cursor);

select 
        count(*) 
from 
        t_interval 
where   order_date >= date '2020-07-01'  -- to_date('01-Jul-2020','dd-mon-yyyy')
and     order_date <  date '2020-08-01'  -- to_date('01-Aug-2020','dd-mon-yyyy')
/

select * from table(dbms_xplan.display_cursor);

I’ve left in two versions of the date predicates – the “ANSI” style, and a strictly formatted “to_char()” style that is still the approached used most commonly in Oracle systems. The effect on the execution plans is the same for both variations of the date declaration, and here are the two execution plans – reported from an instance of 19.3.0.0:

---------------------------------------------------------------------------------------------------
| Id  | Operation               | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |         |       |       |   130 (100)|          |       |       |
|   1 |  SORT AGGREGATE         |         |     1 |     8 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE|         |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
|   3 |    TABLE ACCESS FULL    | T_RANGE |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
---------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name       | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |            |       |       |   130 (100)|          |       |       |
|   1 |  SORT AGGREGATE         |            |     1 |     8 |            |          |       |       |
|   2 |   PARTITION RANGE SINGLE|            |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
|*  3 |    TABLE ACCESS FULL    | T_INTERVAL |    31 |   248 |   130   (3)| 00:00:01 |     8 |     8 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter(("ORDER_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))

It’s quite possible that you won’t notice the difference these plans, even though I’ve put them directly one below the other; and it’s very likely that most people wouldn’t think about the difference if they didn’t have the two plans available at the same time to compare.

If you can’t spot the difference it’s because I’ve deliberately arranged them in a way that might fool you into not looking carefully enough.

If you’ve spotted the difference it’s probably because you’ve examined the Predicate Information section. There isn’t one for the the t_range example but there is one for the t_interval example – and it was a little sneaky of me to make it easy for you to assume that the one set of predicates I showed was common to the two plans.

Since the plans have been pulled from memory (v$sql_plan) it seems likely that they are truthful and the query based on the interval partitioning is actually checking every row in the partition against two date values. In my example that’s not going to make much difference to performance, but in a datawarehouse with millions of rows per partition there’s scope for a noticeable increase in CPU between the two queries – especially since the block reads are likely to be direct path, bypassing much of the usual (CPU -intensive) buffer cache activity.

Interestingly when I checked the 10053 trace files for the two queries the CPU cost for the pair was identical. Normally you expect to see at least a little CPU cost (even if it’s only tens of units out of millions) for checking a predicate. This raises the question – does the optimizer allow a cost for the t_range table for an event that isn’t going to happen, or is it reporting an event that isn’t going to happen for the t_interval table. [Edit: easy enough to check by doing the test on a simple heap table that clones the data from that one partition, and checking the calculated cost with and without the predicate]

There’s a further ramification to this anomaly, relating to the question on the Oracle-L list server that prompted the investigation. What happens on Exadata where the predicates can be offloaded to storage?

The original question wasn’t actually about the appearance (or not) of the predicates, it was about an unexpected cardinality estimate for a query involving two equi-partitioned tables, so this note has gone completely off-topic from the question; but another strange detail about the predicates showed up when I suggested the creation of a column group on the join columns. A (redundant) predicate that had been offloaded to storage stopped being offloaded; here, from the original posting with the original tables, are the two different Predicate Information sections that appeared – the first without the column group, the second after the column group had been created and its stats collected:

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("ORDER_DETL"."ORDR_RQST_NUMBER"="ORDER_REQST"."ORDR_RQST_NUMBER" AND
              "ORDER_DETL"."ORDR_RQST_DATE"="ORDER_REQST"."ORDR_RQST_DATE")
   6 - storage("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   8 - storage("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<=TO_DATE(' 2020-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("ORDER_DETL"."ORDR_RQST_NUMBER"="ORDER_REQST"."ORDR_RQST_NUMBER" AND
              "ORDER_DETL"."ORDR_RQST_DATE"="ORDER_REQST"."ORDR_RQST_DATE")
   6 - storage("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("ORDER_REQST"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_REQST"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   8 - filter("ORDER_DETL"."ORDR_RQST_DATE">=TO_DATE(' 2020-07-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "ORDER_DETL"."ORDR_RQST_DATE"<TO_DATE(' 2020-08-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Notice how the storage() predicate that appears at operation 8 in the first set has disappeared from the second even though operation 6 manages to retain its storage() predicate throughout.

I’ve modelled a further example of odd behaviour using two pairs of tables – one pair using range/hash partitioning the other interval/hash partitioning. I won’t reproduce it here but the problem of redundant predicates appearing and then disappearing gets even stranger.

I haven’t yet produced an example where the unexpected predicate behaviour has affected the costs or cardinalities – but I’ve only spent a couple of hours playing around with well-formed examples: it’s possible that in badly formed examples (e.g. with statistical inconsistencies) the side effect could mean that two notionally identical queries produce different numbers and different plans because they end up with different predicates in the final transformed query.

Footnote

Following an email from David Kurtz, it occurred to me that I should have made it clear that the disappearance of predicates on the partition key is expected behaviour when the predicates are clearly synchronised with the partition boundaries. This is why it’s a good idea to avoid the between operator when dealing with partitions – the content of a partition is “greater than or equal to” the high_value of the previous partition and “strictly less than” its own high_value whereas between expands to “greater than or equal to and less than or equal to.

The predicate behaviour for the interval partitioning is the oddity, the behaviour for the “normal” range partitioning is expected.

July 15, 2020

Fetch First vs. Rownum

Filed under: Execution plans,Oracle,Partitioning,Performance,Problem Solving,Tuning — Jonathan Lewis @ 10:11 am BST Jul 15,2020

I’ve pointed out fairly frequently that if you’re running Standard Edition but would like to take advantage of a few features of the Partitioning option then you might be able to do something appropriate with Partition Views (but I’ve just discovered while searching my blog for a suitable item to link to that I haven’t published any of my PV notes on the blog).

I’ve also pointed out that while 12c allows you to use “fetch first N rows” instead of “where rownum <= N” there’s a hidden threat to using the feature because “fetch first N” turns into a hidden row_number() over() analytic function.

Today’s note is a combination of these two topics, prompted by a request to solve a particular coding issue that has appeared a few times on the Oracle Developer Forum and is likely to be recognisable to a number of people.

I’ll start with a very simple model demonstrateing the simplest use of “fetch first N”:


rem
rem     Script:         fetch_first_union.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jul 2020
rem     Purpose:        
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table t_odd
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        'O'                             flag,
        chr(65 + mod(rownum,26))        class,
        2 * rownum - 1                  id,
        lpad(2 * rownum,10,'0')         v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;

alter table t_odd modify(flag not null, class not null, id not null);
alter table t_odd add constraint to_chk_odd check (flag = 'O');

create index to_i1 on t_odd(class, id);

With this data set I want to write a query that selects rows for class A where id > 9500, ordered by id – but I only want the first two rows. Here’s a very simple query that gets the result I want, followed by the execution plan from 12.2.0.1 (the A-Rows and E-Rows from 19.3 are slightly different):

set serveroutput off
set linesize 180

alter session set statistics_level = all;

select  /*+ index(t_odd (class, id)) */
        *
from
        t_odd
where
        class = 'A'
and     id > 9500
order by
        class, id
fetch 
        first 2 rows only
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost'));

--------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |      1 |        |    95 (100)|      2 |00:00:00.01 |       6 |
|*  1 |  VIEW                         |       |      1 |      2 |    95   (0)|      2 |00:00:00.01 |       6 |
|*  2 |   WINDOW NOSORT STOPKEY       |       |      1 |    202 |    95   (0)|      2 |00:00:00.01 |       6 |
|   3 |    TABLE ACCESS BY INDEX ROWID| T_ODD |      1 |    202 |    95   (0)|      3 |00:00:00.01 |       6 |
|*  4 |     INDEX RANGE SCAN          | TO_I1 |      1 |    202 |     2   (0)|      3 |00:00:00.01 |       3 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=2)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "T_ODD"."CLASS","T_ODD"."ID")<=2)
   4 - access("CLASS"='A' AND "ID">9500)

The plan – with its stats – shows us that we’ve done an index range scan of an index which will return the data in exactly the order we want, and the “fetch first 2 rows” has been translated into the row_number() over() that we expected; but to our great joy the “window sort stopkey” makes the processing stop very early because Oracle recognises that the base data is arriving in the right order so it isn’t necessary to fetch all of it and sort it. The A-Rows column confirms this interpretation of what has happened.

You might notice, by the way, that the optimizer has costed the query as if it were fetching all the rows even though it “knows” that it’s going to fetch only the first two rows. That’s why I had to include the index hint to make the optimizer use the obvious index – a popular alternative is to use the /*+ first_rows(N) */ hint where N matches, or is similar to, the number of rows you want to fetch. If I had omitted the hint the optimizer would have done a full tablescan and then applied a “window sort pushed rank” operation to sort and limit the result to 2 rows.

So now we come to the real problem: the user has a “current” table and an identical “history” table, and would like to replace the table reference with a reference to a union all view for their clients on Standard Edition, or to a partitioned table for clients running Enterprise Edition – and they don’t really want to do any other code changes. So let’s see what happens when we model the union all. I started with a table called t_odd that held only odd values for id, so I’m going to add a table called t_even that holds only even values for id.

create table t_even
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        'E'                             flag,
        chr(65 + mod(rownum,26))        class,
        2 * rownum                      id,
        lpad(2 * rownum,10,'0')         v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;

alter table t_even modify(flag not null, class not null, id not null);
alter table t_even add constraint te_chk_even check (flag = 'E');

create index te_i1 on t_even(class, id);

create or replace view v_bare 
as
select * from t_odd
union all
select * from t_even
/

select
        /*+ 
                index(vw.t_odd  (class, id)) 
                index(vw.t_even (class, id)) 
        */
        *
from
        v_bare vw
where
        class = 'A'
and     id > 9500
order by
        class, id
fetch 
        first 2 rows only
;

As you can see t_even is an identically structured table with similar data, and I’ve created a union all view on top of the two tables, changing the query to reference the view rather than referencing a table. Thanks to the costing anomaly (combined with the small size of the tables) I’ve had to supply a couple of “global” hints to tell the optimizer to use the indexes to access the two tables. So how well does the optimizer do its job when we have a union all view?


----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |        |      1 |        |   192 (100)|      2 |00:00:00.01 |     190 |       |       |          |
|*  1 |  VIEW                                   |        |      1 |      2 |   192   (2)|      2 |00:00:00.01 |     190 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK               |        |      1 |    404 |   192   (2)|      2 |00:00:00.01 |     190 |  2048 |  2048 | 2048  (0)|
|   3 |    VIEW                                 | V_BARE |      1 |    404 |   191   (1)|    404 |00:00:00.01 |     190 |       |       |          |
|   4 |     UNION-ALL                           |        |      1 |        |            |    404 |00:00:00.01 |     190 |       |       |          |
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| T_ODD  |      1 |    202 |    95   (0)|    202 |00:00:00.01 |      95 |       |       |          |
|*  6 |       INDEX RANGE SCAN                  | TO_I1  |      1 |    202 |     2   (0)|    202 |00:00:00.01 |       2 |       |       |          |
|   7 |      TABLE ACCESS BY INDEX ROWID BATCHED| T_EVEN |      1 |    202 |    95   (0)|    202 |00:00:00.01 |      95 |       |       |          |
|*  8 |       INDEX RANGE SCAN                  | TE_I1  |      1 |    202 |     2   (0)|    202 |00:00:00.01 |       2 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=2)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "VW"."CLASS","VW"."ID")<=2)
   6 - access("CLASS"='A' AND "ID">9500)
   8 - access("CLASS"='A' AND "ID">9500)

Answer: Bad luck, the optimizer isn’t smart enought to find a cheap way through this query. It’s fetched all the relevant data from the two tables before applying the window sort (which it does with some efficiency – the pushed rank) to produce the right answer. As you can see from the A-Rows column, though, it’s had to acquire a couple of hundred rows from each table before getting down to the 2 rows we wanted.

Partitioned Tables

So let’s try to solve the problem by buying into the partitioning option and creating a list-partitioned table with two partitions, one flagged for current data and one flagged for history data – or ‘O’dd and ‘E’ven data as I’ve created them in my model.



create table t_pt (
        flag,
        class,
        id,
        v1,
        padding
)
partition by list (flag) (
        partition pO values('O'),
        partition pE values('E')
)
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        'O'                             flag,
        chr(65 + mod(rownum,26))        class,
        2 * rownum - 1                  id,
        lpad(2 * rownum,10,'0')         v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;

insert into t_pt
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        'E'                             flag,
        chr(65 + mod(rownum,26))        class,
        2 * rownum                      id,
        lpad(2 * rownum,10,'0')         v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4   -- > comment to avoid WordPress format issue
;


create index tp_i1 on t_pt(class, id) local;
alter table t_pt modify (flag not null, class not null, id not null);

execute dbms_stats.gather_table_stats(user,'t_pt',method_opt=>'for all columns size 1', cascade=>true, granularity=>'ALL')

Note particularly that I have created a local index on this partitioned table – so there’s a very close correspondance between the two tables in the previous example and the two partitions in this example. Here’s the plan when I query the partitioned table for the first two rows:


select  /*+ index(t_pt (class, id)) */
        *
from
        t_pt
where
        class = 'A'
and     id > 9500
order by
        class, id
fetch 
        first 2 rows only
;


--------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |       |      1 |        |   190 (100)|      2 |00:00:00.01 |     189 |       |       |          |
|*  1 |  VIEW                                        |       |      1 |      2 |   190   (2)|      2 |00:00:00.01 |     189 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK                    |       |      1 |    404 |   190   (2)|      2 |00:00:00.01 |     189 |  2048 |  2048 | 2048  (0)|
|   3 |    PARTITION LIST ALL                        |       |      1 |    404 |   189   (1)|    404 |00:00:00.01 |     189 |       |       |          |
|   4 |     TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T_PT  |      2 |    404 |   189   (1)|    404 |00:00:00.01 |     189 |       |       |          |
|*  5 |      INDEX RANGE SCAN                        | TP_I1 |      2 |    404 |     4   (0)|    404 |00:00:00.01 |       4 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=2)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "T_PT"."CLASS","T_PT"."ID")<=2)
   5 - access("CLASS"='A' AND "ID">9500)

The optimizer has let us down again. The plan shows us that we have to acquire all the relevant data from the two partitions before applying the row_number() analytic function and discarding all but the two rows we want. (Again we can check the A-Rows column to see that we have started by fetching a total of 404 rows from the table.)

But what happens if we fall back to the good old-fashioned (non-standard) rownum method:


select
        *
from    (
        select  /*+ index(t_pt (class, id)) */
                *
        from
                t_pt
        where
                class = 'A'
        and     id > 9500
        order by
                class, id
        )
where
        rownum <= 2
;

----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                      | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                               |       |      1 |        |   190 (100)|      2 |00:00:00.01 |       6 |       |       |          |
|*  1 |  COUNT STOPKEY                                 |       |      1 |        |            |      2 |00:00:00.01 |       6 |       |       |          |
|   2 |   VIEW                                         |       |      1 |    404 |   190   (2)|      2 |00:00:00.01 |       6 |       |       |          |
|*  3 |    SORT ORDER BY STOPKEY                       |       |      1 |    404 |   190   (2)|      2 |00:00:00.01 |       6 |  2048 |  2048 | 2048  (0)|
|   4 |     PARTITION LIST ALL                         |       |      1 |    404 |   189   (1)|      4 |00:00:00.01 |       6 |       |       |          |
|*  5 |      COUNT STOPKEY                             |       |      2 |        |            |      4 |00:00:00.01 |       6 |       |       |          |
|   6 |       TABLE ACCESS BY LOCAL INDEX ROWID BATCHED| T_PT  |      2 |    404 |   189   (1)|      4 |00:00:00.01 |       6 |       |       |          |
|*  7 |        INDEX RANGE SCAN                        | TP_I1 |      2 |    404 |     4   (0)|      4 |00:00:00.01 |       4 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=2)
   3 - filter(ROWNUM<=2)
   5 - filter(ROWNUM<=2)
   7 - access("CLASS"='A' AND "ID">9500)

Oracle really knows how to use rownum well – notice how there is a count stopkey operation as a child to the partition list all operation, and that’s where our rownum <= 2 predicate is first applied. For each partition Oracle finds “the first two rows” and after it has collected two rows from every partition it sorts them (again with a stopkey) to find the top two in that subset. Check the A-Rows column – we selected a total of 4 rows from the table (2 per partition) and reduced that to 2 rows at operation 3.

Conclusion

There are some patterns of processing where partitioned tables can be a lot friendlier to coders than partition views; if you do have to stick with Standard Edition you can usually get what you want but the coding investment may be significantly higher. Even with partitioned tables, though, there are some “old-fashioned” Oracle methods that do a much nicer job than some of the new-fangled “ANSI” mechanisms.

Footnote

Part of the problem presented here revolves around the desire to keep a pattern of SQL generation that already exists, doing nothing more than replacing a table name with a view (or partitioned table) name.

As we’ve seen, if you start with a simple heap table and try to replace it with a partitioned table you have to use the rownum mechanism rather than the fetch first N rows mechanism.

If you’re running with Standard Edition you can’t do anything simple to replace a table name with the name of a union all view; you’d have to change your code generator to apply all the predicates twice (once for each table) and then apply the rownum predicate or fetch first directive again outside the union all. In other words you have to emulate exactly what Oracle EE manages to do with partitioned tables and rownum.

select
        flag, class, id, v1
from
        (
        select
                /*+ index(t_odd (class, id)) */
                flag, class, id, v1
        from
                t_odd
        where
                class = 'A'
        and     id > 9500
        order by
                class, id
        fetch
                first 2 rows only
        )
        union all
        (
        select
                /*+ index(t_even (class, id)) */
                flag, class, id, v1
        from
                t_even
        where
                class = 'A'
        and     id > 9500
        order by
                class, id
        fetch
                first 2 rows only
        )
order by
        class, id
fetch 
        first 2 rows only
;

-----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |        |      1 |        |   192 (100)|      2 |00:00:00.01 |       8 |       |       |          |
|*  1 |  VIEW                              |        |      1 |      2 |   192   (2)|      2 |00:00:00.01 |       8 |       |       |          |
|*  2 |   WINDOW SORT PUSHED RANK          |        |      1 |      4 |   192   (2)|      2 |00:00:00.01 |       8 |  2048 |  2048 | 2048  (0)|
|   3 |    VIEW                            |        |      1 |      4 |   191   (1)|      4 |00:00:00.01 |       8 |       |       |          |
|   4 |     UNION-ALL                      |        |      1 |        |            |      4 |00:00:00.01 |       8 |       |       |          |
|   5 |      VIEW                          |        |      1 |      2 |    95   (0)|      2 |00:00:00.01 |       4 |       |       |          |
|*  6 |       VIEW                         |        |      1 |      2 |    95   (0)|      2 |00:00:00.01 |       4 |       |       |          |
|*  7 |        WINDOW NOSORT STOPKEY       |        |      1 |    202 |    95   (0)|      2 |00:00:00.01 |       4 | 73728 | 73728 |          |
|   8 |         TABLE ACCESS BY INDEX ROWID| T_ODD  |      1 |    202 |    95   (0)|      3 |00:00:00.01 |       4 |       |       |          |
|*  9 |          INDEX RANGE SCAN          | TO_I1  |      1 |    202 |     2   (0)|      3 |00:00:00.01 |       2 |       |       |          |
|* 10 |      VIEW                          |        |      1 |      2 |    95   (0)|      2 |00:00:00.01 |       4 |       |       |          |
|* 11 |       WINDOW NOSORT STOPKEY        |        |      1 |    202 |    95   (0)|      2 |00:00:00.01 |       4 | 73728 | 73728 |          |
|  12 |        TABLE ACCESS BY INDEX ROWID | T_EVEN |      1 |    202 |    95   (0)|      3 |00:00:00.01 |       4 |       |       |          |
|* 13 |         INDEX RANGE SCAN           | TE_I1  |      1 |    202 |     2   (0)|      3 |00:00:00.01 |       2 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("from$_subquery$_007"."rowlimit_$$_rownumber"<=2)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "from$_subquery$_006"."CLASS","from$_subquery$_006"."ID")<=2)
   6 - filter("from$_subquery$_003"."rowlimit_$$_rownumber"<=2)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "CLASS","ID")<=2)
   9 - access("CLASS"='A' AND "ID">9500)
  10 - filter("from$_subquery$_005"."rowlimit_$$_rownumber"<=2)
  11 - filter(ROW_NUMBER() OVER ( ORDER BY "CLASS","ID")<=2)
  13 - access("CLASS"='A' AND "ID">9500)


As you can see, the E-Rows still predicts a lot of work, but the A-Rows tells us the work was kept to the minimum we want.

 

February 18, 2020

Interval Partition(s)

Filed under: Oracle,Parallel Execution,Partitioning — Jonathan Lewis @ 1:45 pm GMT Feb 18,2020

A quirky little feature of interval partitioning showed up on Twitter today – a parallel insert that would only use a single PX slave to do the inserting. With 1.1 billion rows and the option for running parallel 32 this made the loading process rather slower than it ought to have been.

Fortunately it’s quite easy to model (and work around) the oddity. So here’s a small data set and an empty partitioned table to work with:


rem
rem     Script:         pt_int_load_anomaly.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem

create table t1 
nologging 
as
select 
        ao.* 
from 
        all_Objects ao, 
        (select rownum id from dual connect by level <= 20)
;

create table pt1
partition  by range (object_id) interval (1000000) (
        partition p1 values less than (1)
)
as
select  * 
from    all_Objects
where   rownum = 0
/

I’ve created a table by copying all_objects 20 times which, for my little sandbox, has given me a total of about 1.2M rows. Then I’ve created an empty interval-partitioned clone of all_objects, with the first partition defined to hold all rows where the object_id is less than 1 (and there’s no object in the database that could match that criterion). I’ve defined the interval to be 1,000,000 and since the highest object_id in my database is about 90,000 the first partition that gets added to this table will be able to hold all the data from t1.

So now we try to do a parallel insert from t1 into pt1, and check the execution plan and parallel execution statistics:


set serveroutput off

insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select * from t1;

select * from table(dbms_xplan.display_cursor);

start pq_tqstat

Note how I’ve used the hint /*+ enable_parallel_dml */ (possible a 12c hint back-ported to 11.2.0.4) rather than using an “alter session”, it’s just a little convenience to be able to embed the directive in the SQL. The pq_tqstat script is one I published some time ago to report the contents of the session-specific dynamic performance view v$pq_tqstat immediately after running a parallel statement.

Here’s the plan:


SQL_ID  25hub68pf1z1s, child number 0
-------------------------------------
insert /*+ append enable_parallel_dml parallel(6) */ into pt1 select *
from t1

Plan hash value: 2888707464

-------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                            |          |       |       |   631 (100)|          |        |      |            |
|   1 |  PX COORDINATOR                             |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)                       | :TQ10001 |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    LOAD AS SELECT (HIGH WATER MARK BROKERED)| PT1      |       |       |            |          |  Q1,01 | PCWP |            |
|   4 |     OPTIMIZER STATISTICS GATHERING          |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE                             |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,01 | PCWP |            |
|   6 |       PX SEND RANDOM LOCAL                  | :TQ10000 |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | P->P | RANDOM LOCA|
|   7 |        PX BLOCK ITERATOR                    |          |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | PCWC |            |
|*  8 |         TABLE ACCESS FULL                   | T1       |  1235K|   159M|   631  (10)| 00:00:01 |  Q1,00 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   8 - access(:Z>=:Z AND :Z<=:Z)

Note
-----
   - Degree of Parallelism is 6 because of hint

The most important detail of this plan is that the PX slaves do the load as select (operation 3), then send a message to the query coordinator (PX send QC, operation 2) to tell it about the data load. They do not send their data to the QC for the QC to do the load.

So the plan says we will be doing parallel DM, but here’s what v$pq_tqstat tells us:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P006                215880   34785363     17.47      16.86         16          0           0
                                             1 P007                202561   34436325     16.39      16.69         17          0           0
                                             1 P008                207519   34564496     16.79      16.75         17          0           0
                                             1 P009                208408   34594770     16.86      16.77         17          0           0
                                             1 P00A                198915   33529627     16.10      16.25         16          0           0
                                             1 P00B                202537   34430603     16.39      16.69         16          0           0
                      Consumer               1 P000                     0        144      0.00       0.00         51         47           0
                                             1 P001                     0        144      0.00       0.00         51         47           0
                                             1 P002               1235820  206340464    100.00     100.00         75         47           0
                                             1 P003                     0        144      0.00       0.00         51         47           0
                                             1 P004                     0        144      0.00       0.00       1138       1134           0
                                             1 P005                     0        144      0.00       0.00       1137       1133           0

                    1 Producer               1 P000                     0         24      0.00       5.91         51         42           0
                                             1 P001                     0         24      0.00       5.91         50         41           0
                                             1 P002                     2        286    100.00      70.44         58         14           0
                                             1 P003                     0         24      0.00       5.91         51         43           0
                                             1 P004                     0         24      0.00       5.91         51         42           0
                                             1 P005                     0         24      0.00       5.91         51         43           0
                      Consumer               1 QC                       2        406    100.00     100.00        311        179           0

19 rows selected.

The query did run parallel 6 as hinted – and 6 PX slaves scanned the t1 table; but they all sent all their data to one PX slave in the second slave set and that one PX slave did all the inserts. The plan was parallel, but the execution was effectively serial. (You’ll note there is something a little odd about the waits and timeout for p004 and p005 when they are acting as consumers. I may worry about that later, but it could be a host-based side effect of running parallel 6 on a VM with 4 CPUs).

The serialization leads to two questions

  1. What went wrong?
  2. How do we work around this and make the insert “truly” parallel

My answer to (1) is “I don’t know – but I’ll look at it if necessary” combined with the guess – it’s something to do with the table having only one partition at the outset and this has an unexpected side effect on the randomising function for the PX distribution.

My answer to (2) is “if I’m right about (1), why not try pre-defining two partitions, and I’ll even let both of them stay empty”.

So here’s my new definition for pt1:


create table pt1
partition  by range (object_id) interval (1000000) (
        partition p0 values less than (0),
        partition p1 values less than (1)
)
as
select  * 
from    all_Objects
where   rownum = 0
/

Re-running the test with the completely redundant, and permanently empty p0 partition the plan doesn’t change but the results from v$pq_tqstat change dramatically:


DFO_NUMBER      TQ_ID SERVER_TYPE     INSTANCE PROCESS           NUM_ROWS      BYTES ROW_SHARE DATA_SHARE      WAITS   TIMEOUTS AVG_LATENCY
---------- ---------- --------------- -------- --------------- ---------- ---------- --------- ---------- ---------- ---------- -----------
         1          0 Producer               1 P006                207897   34581153     16.82      16.76         23          4           0
                                             1 P007                215669   34786429     17.45      16.86         30          5           0
                                             1 P008                221474   36749626     17.92      17.81         28          5           0
                                             1 P009                204959   34497164     16.58      16.72         22          2           0
                                             1 P00A                177755   30141002     14.38      14.61         21          0           0
                                             1 P00B                208066   35585810     16.84      17.25         25          2           0
                      Consumer               1 P000                213129   35612973     17.25      17.26         82         57           0
                                             1 P001                200516   33570586     16.23      16.27         84         55           0
                                             1 P002                203395   33950449     16.46      16.45         83         56           0
                                             1 P003                205458   34235575     16.63      16.59         82         54           0
                                             1 P004                204111   33999932     16.52      16.48        581        555           0
                                             1 P005                209211   34971669     16.93      16.95        580        553           0

                    1 Producer               1 P000                     2        286     16.67      16.67        422        149           0
                                             1 P001                     2        286     16.67      16.67        398        130           0
                                             1 P002                     2        286     16.67      16.67        405        128           0
                                             1 P003                     2        286     16.67      16.67        437        161           0
                                             1 P004                     2        286     16.67      16.67        406        116           0
                                             1 P005                     2        286     16.67      16.67        440        148           0
                      Consumer               1 QC                      12       1716    100.00     100.00        242        111           0



19 rows selected.

Every consumer receives and inserts roughly 200,000 rows – it’s a totally fair parallel DML. Timings are pretty irrelevant for such a small data set but the excution time did drop from 7 seconds to 4 seconds when parallelism was working “properly”.

I’ve tested this script on 12.2.0.1 and 19.3.0.0 – the same anomaly appears in both versions though it might be worth noting that the strange skew in the waits and timeouts doesn’t appear in 19.3.0.0.

February 14, 2020

Char problems

Filed under: Bugs,Oracle,Partitioning,Upgrades — Jonathan Lewis @ 3:25 pm GMT Feb 14,2020

The semantics of comparing character columns of different types can lead to some confusion, so before I get into the main body of this note here’s a little test based on a table with one row:


create table t1(c2 char(2), c3 char(3), vc2 varchar2(2), vc3 varchar2(3));

insert into t1 values ('XX','XX','XX','XX');
commit;

select count(*) c2_c3   from t1 where c2 = c3;
select count(*) c2_vc3  from t1 where c2 = vc3;
select count(*) c3_vc2  from t1 where c3 = vc2;
select count(*) c3_vc3  from t1 where c3 = vc3;

I’ve inserted one row, using the same value for every single column; then I’ve been counting the row(s) where various pairs of columns match. Which (if any) of the four queries return the value 1 and which (if any) return the value zero ?

To help you, here’s a quote from the relevant Oracle manual about character comparison semantics:

Blank-Padded and Nonpadded Comparison Semantics

With blank-padded semantics, if the two values have different lengths, then Oracle first adds blanks to the end of the shorter one so their lengths are equal. Oracle then compares the values character by character up to the first character that differs. The value with the greater character in the first differing position is considered greater. If two values have no differing characters, then they are considered equal. This rule means that two values are equal if they differ only in the number of trailing blanks. Oracle uses blank-padded comparison semantics only when both values in the comparison are either expressions of data type CHAR, NCHAR, text literals, or values returned by the USER function.

With nonpadded semantics, Oracle compares two values character by character up to the first character that differs. The value with the greater character in that position is considered greater. If two values of different length are identical up to the end of the shorter one, then the longer value is considered greater. If two values of equal length have no differing characters, then the values are considered equal. Oracle uses nonpadded comparison semantics whenever one or both values in the comparison have the data type VARCHAR2 or NVARCHAR2.

The first two queries return 1, the second two return zero.

  1. Query 1: c2 is blank padded to match c3 in length before the comparison, so the values are ‘XX {space}’
  2. Query 2: c2 is not padded, so the compared values are both ‘XX’
  3. Query 3: c3 is three characters long, vc2 is only 2 characters long and does not get padded to match c3
  4. Query 4: c3 is three characters long, vc3 is only 2 characters long and does not get padded to match c3

One interesting by-product of this example is this:

  • c3 = c2 and c2 = vc3 but c3 != vc3     whatever happened to transitive closure!

So we come to the point of the article, which is this:

Be very careful about using char() (or nchar) types in your tables – especially if you’re thinking of using columns of type [n]char() in join predicates (or predicates that don’t start life as join predicates but become join predicates through transitive closure).

Here’s an interesting bug that has appeared (I think) as a side effect of the need for blank-padded semantics. We start with two tables that I’ll be joining with a hash join – one of them will be a small table that will be used as the “build” table, the other will be (faked to look like) a large table that will be used as the “probe” table.


rem
rem     Script:         bloom_prune_char_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Jan 2020
rem
rem     Last tested 
rem             19.3.0.0
rem             12.2.0.1
rem

create table test_probe(status char(3)) partition by list(status) (partition st_1 values('00','BI'));

create table test_build(status char(2)); 

insert into test_build values('00');
insert into test_probe values('00');
insert into test_build values('BI');
insert into test_probe values('BI');

commit;
 
prompt  =====================================
prompt  Fake large table stats for test_probe
prompt  =====================================

exec dbms_stats.set_table_stats(null,'test_probe',numrows=>2000000);

spool bloom_prune_char_bug
set linesize 156
set pagesize 60

set serveroutput off

select  /*+ 
                gather_plan_statistics 
        */
        * 
from 
        test_build b,
        test_probe a 
where 
        a.status = b.status
;

select * from table(dbms_xplan.display_cursor(null,null,'projection partition allstats last'))
/


The two tables have a pair of matching rows – so the query should return two rows. But it doesn’t – it returns no rows, and the clue about why not is in the execution plan (which I’ve pulled from memory with lots of extra bits and pieces). Here’s the output from running this script (from the query onwards) on an instance of 12.2.0.1:


no rows selected


PLAN_TABLE_OUTPUT
-----------------------------------------------------------------
SQL_ID  2295z4p6m4557, child number 0
-------------------------------------
select /*+   gather_plan_statistics  */  * from  test_build b,
test_probe a where  a.status = b.status

Plan hash value: 177769189

--------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name       | Starts | E-Rows | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |            |      1 |        |       |       |      0 |00:00:00.01 |       7 |       |       |          |
|*  1 |  HASH JOIN               |            |      1 |   2000K|       |       |      0 |00:00:00.01 |       7 |  2078K|  2078K|  766K (0)|
|   2 |   PART JOIN FILTER CREATE| :BF0000    |      1 |      2 |       |       |      2 |00:00:00.01 |       7 |       |       |          |
|   3 |    TABLE ACCESS FULL     | TEST_BUILD |      1 |      2 |       |       |      2 |00:00:00.01 |       7 |       |       |          |
|   4 |   PARTITION LIST SINGLE  |            |      1 |   2000K|KEY(AP)|KEY(AP)|      0 |00:00:00.01 |       0 |       |       |          |
|   5 |    TABLE ACCESS FULL     | TEST_PROBE |      0 |   2000K|     1 |     1 |      0 |00:00:00.01 |       0 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("A"."STATUS"="B"."STATUS")

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=1) "B"."STATUS"[CHARACTER,2], "A"."STATUS"[CHARACTER,3]
   2 - INTERNAL_FUNCTION("B"."STATUS")[2], INTERNAL_FUNCTION("B"."STATUS")[2], "B"."STATUS"[CHARACTER,2]
   3 - "B"."STATUS"[CHARACTER,2]
   4 - (rowset=256) "A"."STATUS"[CHARACTER,3]
   5 - (rowset=256) "A"."STATUS"[CHARACTER,3]

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

The optimizer has used a Bloom filter to do partition pruning, and while we can see operation 4 reporting a “partition list single” operation using “and pruning” (AP), we can see that operation 5 reports zero starts. This is because the Bloom filter has been used to determine that there are no relevant partitions!

Looking down at the (rarely examined) projection information we can see why – operation 2 (the “part join filter create”) has a strange “Internal Function” in its projection, and references B.STATUS as character[2]. It looks as if the Bloom filter that identifies partitions has been built using a char(2) as the input to its hashing function – which is bad news when the resulting filter is used to check the hash values returned from the partition definition that is hash a char(3).

If my thoughts about the mismatch in how the Bloom filters for the build and probe tables are built then a test that would help to confirm the hypothesis would be disable Bloom filter pruning – which you can only do by setting a hidden parameter, possibly in a hint or SQL Patch):

select 
        /*+ 
                gather_plan_statistics 
                opt_param('_bloom_pruning_enabled','false') 
        */  
        * 
from 
        test_build b,
        test_probe a 
where
        a.status = b.status;

select * from table(dbms_xplan.display_cursor(null,null,'projection partition allstats last'))
/


ST STA
-- ---
00 00
BI BI

2 rows selected.


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------
SQL_ID  9whuurpkm3wpw, child number 0
-------------------------------------
select  /*+   gather_plan_statistics
opt_param('_bloom_pruning_enabled','false')   subquery_pruning(a)  */
* from  test_build b,  test_probe a where  a.status = b.status

Plan hash value: 787868928

------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name       | Starts | E-Rows | Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |            |      1 |        |       |       |      2 |00:00:00.01 |      45 |       |       |          |
|*  1 |  HASH JOIN             |            |      1 |   2000K|       |       |      2 |00:00:00.01 |      45 |  2078K|  2078K|  866K (0)|
|   2 |   TABLE ACCESS FULL    | TEST_BUILD |      1 |      2 |       |       |      2 |00:00:00.01 |       7 |       |       |          |
|   3 |   PARTITION LIST SINGLE|            |      1 |   2000K|     1 |     1 |      2 |00:00:00.01 |      38 |       |       |          |
|   4 |    TABLE ACCESS FULL   | TEST_PROBE |      1 |   2000K|     1 |     1 |      2 |00:00:00.01 |      38 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access("A"."STATUS"="B"."STATUS")

Column Projection Information (identified by operation id):
-----------------------------------------------------------
   1 - (#keys=1) "B"."STATUS"[CHARACTER,2], "A"."STATUS"[CHARACTER,3]
   2 - (rowset=256) "B"."STATUS"[CHARACTER,2]
   3 - (rowset=256) "A"."STATUS"[CHARACTER,3]
   4 - (rowset=256) "A"."STATUS"[CHARACTER,3]

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

Adding the hint opt_param(‘_bloom_pruning_enabled’,’false’) to the query we get the right results and, of course, we can see that there is no operation in the execution plan to generate and use the Bloom filter that is probably causing the problem.

Conclusion

If you are going to use char() types in your tables, and if you are going to compare columns of type char() make sure that the columns are defined to be exactly the same length – or that you include an explicit cast() to guarantee that the shorter column appears to be the same length as the longer column.

Footnote

This bug appeared in my MOS “hot topics”email a couple of days ago as

Bug 27661222: WRONG RESULTS WITH PARTITION PRUNING COMPARING CHAR COLUMNS OF DIFFERENT LENGTH

Reading the bug note the problem is described as a bug in “AND pruning” with a workaround of setting the hidden parameter “_and_pruning_enabled” to false (possibly through the opt_param() hint). I suspect that the underlying problem may be the Bloom filter itself and that disabling Bloom filter pruning for the query may be a slightly less aggressive workaround.

The bug is reported as fixed in 20.1 – but you don’t need to upgrade just yet because, apart from the workarounds, there are various patches available back to 19.x and 12.2.

The sample script above is basically the example in the bug note with a few minor changes.

Update (March 2020)

Hot on the heels of the previous bug report, another one showed up today that related to problems with char() types and comparisons. The title of the bug isn’t an accurate description of the problem: Bug 29002488 : WRONG RESULT WITH A QUERY USING UNION ALL. However the note includes a test case which started life as a join to a union all.

The bug is reported as fixed in 20.1, but there are patches and backports into various versions of 12.2, 18c and 19c.

Update (March 2020 again)

Another related bug has shown up in my “hot topics” email: B28910498 : WRONG RESULT WITH JOIN TABLES WITH CHAR COLUMNS OF DIFFERENT LENGTH. Again fixed in 20.1; apparently introduced in 12.2.0.1.

In this case with a workaround of setting optimizer_features_enable back to 12.1.0.2 or 11.2.0.4, or adding the hint /*+ no_push_pred */ to the (invisible) query.  The last workaround suggests, however, that the problem may be older than 12.2.0.1, and setting OFE simply disables an enhancement to predicate pushing in 12.2 that allows the problem to become visible for the user’s query.

 

 

 

 

 

 

August 9, 2019

Split Partition

Filed under: Oracle,Partitioning,Performance,Tuning — Jonathan Lewis @ 1:02 pm BST Aug 9,2019

This is a little case study on “pre-emptive trouble-shooting”, based on a recent question on the Oracle database forum asking about splitting a range-based partition into two at a value above the current highest value recorded in the maxvalue partition.

The general principle for splitting (range-based) partitions is that if the split point is above the highest value currently stored in the partition Oracle will recognise that it can simply rename the existing partition and create a new, empty partition leaving all the indexes (including the global and globally partitioned indexes) in a valid state. There are, however, three little wrinkles to this particular request:

  • first is that the question relates to a system running 10g
  • second is that there is a LOB column in the table
  • third is that is a requirement to have the new (higher value) partition(s) in a different tablespace

It’s quite possible that 10g won’t have all the capabilities of partition maintenance of newer versions and if anything is going to go wrong doing something with LOBs is always a highly dependable point of failure, and since all the examples in the manuals tend to be very simple examples maybe any attempt to introduce complications like tablespace specification will cause problems.

So, before you risk doing the job in production, what are you going to test?

In Oracle terms we want to check the following

  • Will Oracle have silently copied/rebuilt some segments rather than simply renaming old segments and creating new, empty segments.
  • Will the segments end up where we want them
  • Will all the indexes stay valid

To get things going, the OP had supplied a framework for the table, told us about two indexes, and given us two SQL statement that might be suitable candidates to do the split, stating that he (or she) had tested them and they both worked. Here’s the SQL (with a few tweaks) that creates the table and indexes. I’ve also added some data – inserting one row into each partition.

rem
rem     Script:         split_pt_lob.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2019
rem
rem     Last tested 
rem             12.2.0.1
rem             10.2.0.5
rem

define m_old_ts = 'test_8k'
define m_new_ts = 'assm_2'

drop table part_tab purge;

create table part_tab(
  pt_id            NUMBER,
  pt_name          VARCHAR2(30),
  pt_date          DATE default SYSDATE,
  pt_lob           CLOB,
  pt_status        VARCHAR2(2)
)
tablespace &m_old_ts
lob(pt_lob) store as (tablespace &m_old_ts)
partition by range (pt_date)
(
  partition PRT1 values less than (TO_DATE('2012-01-01', 'YYYY-MM-DD')),
  partition PRT2 values less than (TO_DATE('2014-09-01', 'YYYY-MM-DD')),
  partition PRT_MAX values less than (MAXVALUE)
)
/

alter table part_tab
add constraint pt_pk primary key(pt_id)
/

create index pt_i1 on part_tab(pt_date, pt_name) local
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    1,'one',to_date('01-Jan-2011'),rpad('x',4000),'X'
)
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    2,'two',to_date('01-Jan-2013'),rpad('x',4000),'X'
)cascade=>trueee
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    3,'three',to_date('01-Jan-2015'),rpad('x',4000),'X'
)
/

commit;

execute dbms_stats.gather_table_stats(null,'part_tab',cascade=>true,granularity=>'ALL')

We were told that

The table has
– Primary Key on pt_id column with unique index (1 Different table has FK constraint that refers to this PK)
– Composite index on pt_date and pt_name columns

This is why I’ve added a primary key constraint (which will generate a global index) and created an index on (pt_date,pt_name) – which I’ve created as a local index since it contains the partitioning column.

The description of the requirement was:

  • The Task is to split partition PRT_MAX to a different tablespace
  • New partition(s) won’t have data at the moment of creation

And the two “tested” strategies were:

alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
        PARTITION PRT3    tablespace &m_old_ts,
        PARTITION PRT_MAX tablespace &m_new_ts
);

alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
        PARTITION PRT3    tablespace &m_old_ts LOB (pt_lob) store as (TABLESPACE &m_old_ts), 
        PARTITION PRT_MAX tablespace &m_new_ts LOB (pt_lob) store as (TABLESPACE &m_new_ts)
)
;
 

If we’re going to test these strategies properly we will need queries similar to the following:


break on object_name skip 1
select object_name, subobject_name, object_id, data_object_id  from user_objects order by object_name, subobject_name;

break on index_name skip 1
select index_name, status from user_indexes;
select index_name, partition_name, status from user_ind_partitions order by index_name, partition_name;

break on segment_name skip 1
select segment_name, partition_name, tablespace_name from user_segments order by segment_name, partition_name;

  • First – what are the object_id and data_object_id for each object before and after the split. Did Oracle create new “data objects” while while doing the split or has it simply changed the name of an existing data (physical) object.
  • Secondly – are there any indexes or index partitions that are no longer valid
  • Finally – which tablespaces do physical objects reside in.

From a test run of the first, simpler, split statement here are the before and after results for the object_id and data_object_id, followed by the post-split results for index and segment details:


Before Split
============

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23677          23677
                                 PRT2                        23678          23678
                                 PRT_MAX                     23679          23679
                                                             23676

PT_I1                            PRT1                        23690          23690
                                 PRT2                        23691          23691
                                 PRT_MAX                     23692          23692
                                                             23689

PT_PK                                                        23688          23688

SYS_IL0000023676C00004$$         SYS_IL_P252                 23685          23685
                                 SYS_IL_P253                 23686          23686
                                 SYS_IL_P254                 23687          23687

SYS_LOB0000023676C00004$$        SYS_LOB_P249                23681          23681
                                 SYS_LOB_P250                23682          23682
                                 SYS_LOB_P251                23683          23683
                                                             23680          23680

After split
===========

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23677          23677
                                 PRT2                        23678          23678
                                 PRT3                        23693          23679
                                 PRT_MAX                     23679          23694
                                                             23676

PT_I1                            PRT1                        23690          23690
                                 PRT2                        23691          23691
                                 PRT3                        23700          23692
                                 PRT_MAX                     23699          23699
                                                             23689

PT_PK                                                        23688          23688

SYS_IL0000023676C00004$$         SYS_IL_P252                 23685          23685
                                 SYS_IL_P253                 23686          23686
                                 SYS_IL_P257                 23697          23687
                                 SYS_IL_P258                 23698          23698

SYS_LOB0000023676C00004$$        SYS_LOB_P249                23681          23681
                                 SYS_LOB_P250                23682          23682
                                 SYS_LOB_P255                23695          23683
                                 SYS_LOB_P256                23696          23696
                                                             23680          23680


INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            VALID
SYS_IL0000023676C00004$$         N/A


INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   USABLE
                                 PRT_MAX                USABLE

SYS_IL0000023676C00004$$         SYS_IL_P252            USABLE
                                 SYS_IL_P253            USABLE
                                 SYS_IL_P257            USABLE
                                 SYS_IL_P258            USABLE


SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023676C00004$$  SYS_IL_P252            TEST_8K
                          SYS_IL_P253            TEST_8K
                          SYS_IL_P257            TEST_8K
                          SYS_IL_P258            TEST_8K

SYS_LOB0000023676C00004$$ SYS_LOB_P249           TEST_8K
                          SYS_LOB_P250           TEST_8K
                          SYS_LOB_P255           TEST_8K
                          SYS_LOB_P256           TEST_8K

Before the split partition PRT_MAX – with 4 segments (table, index, LOB, LOBINDEX) – has object_id = data_object_id, with the values: 23679 (table), 23692 (index), 23683 (LOB), 23687 (LOBINDEX); and after the split these reappear as the data_object_id values for partition PRT3 (though the object_id values are larger than the data_object_id values) – so we infer that Oracle has simply renamed the various PRT_MAX objects to PRT3 and created new, empty PRT_MAX objects.

We can also see that all the indexes (including the global primary key index) have remained valid. We also note that the data_object_id of the primary key index has not changed, so Oracle didn’t have to rebuild it to ensure that it stayed valid.

There is a problem, though, the LOB segment and LOBINDEX segments for the new PRT_MAX partition are not in the desired target tablespace. So we need to check the effects of the second version of the split command where we add the specification of the LOB tablespaces. This is what we get – after rerunning the entire test script from scratch:


Before Split
============

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23727          23727
                                 PRT2                        23728          23728
                                 PRT_MAX                     23729          23729
                                                             23726

PT_I1                            PRT1                        23740          23740
                                 PRT2                        23741          23741
                                 PRT_MAX                     23742          23742
                                                             23739

PT_PK                                                        23738          23738

SYS_IL0000023726C00004$$         SYS_IL_P272                 23735          23735
                                 SYS_IL_P273                 23736          23736
                                 SYS_IL_P274                 23737          23737

SYS_LOB0000023726C00004$$        SYS_LOB_P269                23731          23731
                                 SYS_LOB_P270                23732          23732
                                 SYS_LOB_P271                23733          23733
                                                             23730          23730


After Split
===========

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23727          23727
                                 PRT2                        23728          23728
                                 PRT3                        23743          23743
                                 PRT_MAX                     23729          23744
                                                             23726

PT_I1                            PRT1                        23740          23740
                                 PRT2                        23741          23741
                                 PRT3                        23750          23750
                                 PRT_MAX                     23749          23749
                                                             23739

PT_PK                                                        23738          23738

SYS_IL0000023726C00004$$         SYS_IL_P272                 23735          23735
                                 SYS_IL_P273                 23736          23736
                                 SYS_IL_P277                 23747          23747
                                 SYS_IL_P278                 23748          23748

SYS_LOB0000023726C00004$$        SYS_LOB_P269                23731          23731
                                 SYS_LOB_P270                23732          23732
                                 SYS_LOB_P275                23745          23745
                                 SYS_LOB_P276                23746          23746
                                                             23730          23730

INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            UNUSABLE
SYS_IL0000023726C00004$$         N/A

INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   UNUSABLE
                                 PRT_MAX                USABLE

SYS_IL0000023726C00004$$         SYS_IL_P272            USABLE
                                 SYS_IL_P273            USABLE
                                 SYS_IL_P277            USABLE
                                 SYS_IL_P278            USABLE

SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023726C00004$$  SYS_IL_P272            TEST_8K
                          SYS_IL_P273            TEST_8K
                          SYS_IL_P277            TEST_8K
                          SYS_IL_P278            ASSM_2

SYS_LOB0000023726C00004$$ SYS_LOB_P269           TEST_8K
                          SYS_LOB_P270           TEST_8K
                          SYS_LOB_P275           TEST_8K
                          SYS_LOB_P276           ASSM_2


Before looking at the more complex details the first thing that leaps out to hit the eye is the word UNUSABLE – which appears for the status of the (global) primary key index and the PRT3 index subpartition. The (empty) PRT_MAX LOB and LOBINDEX partitions are where we wanted them but by specifying the location we seem to have broken two index segments that will need to be rebuilt.

It gets worse, because if we check the data_object_id of the original PRT_MAX partition (23729) and its matching index partition (23742) we see that they don’t correspond to the (new) PRT3 data_object_id values which are 23743 and 23750 respectively – the data has been physically copied from one data object to another completely unnecessarily; moreover the same applies to the LOB and LOBINDEX segments – the data object ids for the PRT_MAX LOB and LOBINDEX partitions were 23733 and 23737, the new PRT3 data object ids are 23746 and 23747.

If you did a test with only a tiny data set you might not notice the implicit threat that these changes in data_object_id tell you about – you’re going to be copying the whole LOB segment when you don’t need to.

Happy Ending (maybe)

A quick check with 12.2 suggested that Oracle had got much better at detecting that it didn’t need to copy LOB data and invalidate indexes with the second form of the code; but the OP was on 10g – so that’s not much help. However it was the thought that Oracle might misbehave when you specifyied tablespaces that made me run up this test – in particular I had wondered if specifying a tablespace for the partition that would end up holding all the existing data might trigger an accident, so here’s a third variant of the split statement I tested, with the results on the indexes, segments, and data objects. Note that I specify the tablespace only for the new (empty) partition:


alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
    PARTITION PRT3,
    PARTITION PRT_MAX tablespace &m_new_ts  LOB (pt_lob) store as (TABLESPACE &m_new_ts)
)
/

Before Split
============

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23752          23752
                                 PRT2                        23753          23753
                                 PRT_MAX                     23754          23754
                                                             23751

PT_I1                            PRT1                        23765          23765
                                 PRT2                        23766          23766
                                 PRT_MAX                     23767          23767
                                                             23764

PT_PK                                                        23763          23763

SYS_IL0000023751C00004$$         SYS_IL_P282                 23760          23760
                                 SYS_IL_P283                 23761          23761
                                 SYS_IL_P284                 23762          23762

SYS_LOB0000023751C00004$$        SYS_LOB_P279                23756          23756
                                 SYS_LOB_P280                23757          23757
                                 SYS_LOB_P281                23758          23758
                                                             23755          23755

After Split
===========

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23752          23752
                                 PRT2                        23753          23753
                                 PRT3                        23768          23754
                                 PRT_MAX                     23754          23769
                                                             23751

PT_I1                            PRT1                        23765          23765
                                 PRT2                        23766          23766
                                 PRT3                        23775          23767
                                 PRT_MAX                     23774          23774
                                                             23764

PT_PK                                                        23763          23763

SYS_IL0000023751C00004$$         SYS_IL_P282                 23760          23760
                                 SYS_IL_P283                 23761          23761
                                 SYS_IL_P287                 23772          23762
                                 SYS_IL_P288                 23773          23773

SYS_LOB0000023751C00004$$        SYS_LOB_P279                23756          23756
                                 SYS_LOB_P280                23757          23757
                                 SYS_LOB_P285                23770          23758
                                 SYS_LOB_P286                23771          23771
                                                             23755          23755
INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            VALID
SYS_IL0000023751C00004$$         N/A

INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   USABLE
                                 PRT_MAX                USABLE

SYS_IL0000023751C00004$$         SYS_IL_P282            USABLE
                                 SYS_IL_P283            USABLE
                                 SYS_IL_P287            USABLE
                                 SYS_IL_P288            USABLE

SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023751C00004$$  SYS_IL_P282            TEST_8K
                          SYS_IL_P283            TEST_8K
                          SYS_IL_P287            TEST_8K
                          SYS_IL_P288            ASSM_2

SYS_LOB0000023751C00004$$ SYS_LOB_P279           TEST_8K
                          SYS_LOB_P280           TEST_8K
                          SYS_LOB_P285           TEST_8K
                          SYS_LOB_P286           ASSM_2

All the index and index partitions stay valid; the new empty segments all end up in the target tablespace, and all the data object ids for the old PRT_MAX partitions becaome the data object ids for the new PRT3 partitions. Everything we want, and no physical rebuilds of any data sets.

Moral:

When you’re testing, especially when you’re doing a small test while anticipating a big data set, don’t rely on the clock; check the data dictionary (and trace files, if necessary) carefully to find out what activity actually took place.

Footnote:

It’s possible that there are ways to fiddle around with the various default attributes of the partitioned table to get the same effect – but since 12.2 is much better behaved anyway there’s no point in me spending more time looking for alternative solutions to a 10g problem.

 

May 27, 2019

Re-partitioning 2

Filed under: 12c,Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 8:20 pm BST May 27,2019

Last week I wrote a note about turning a range-partitioned table into a range/list composite partitioned table using features included in 12.2 of Oracle. But my example was really just an outline of the method and bypassed a number of the little extra problems you’re likely to see in a real-world system, so in this note I’m going to bring in an issue that you might run into – and which I’ve seen appearing a number of times: ORA-14097: column type or size mismatch in ALTER TABLE EXCHANGE PARTITION.

It’s often the case that a system has a partitioned table that’s been around for a long time, and over its lifetime it may have had (real or virtual) columns added, made inivisble, dropped, or mark unused. As a result you may find that the apparent definition of the table is not the same as the real definition of the table – and that’s why Oracle has given us (in 12c) the option to “create table for exchange”.

If you have a support account you might like to read a MOS note giving you one example of a problem with creating an exchange table prior to this new feature. ORA-14097 At Exchange Partition After Adding Column With Default Value (Doc ID 1334763.1) I’ve created a little model by cloning the code from that note.


rem
rem     Script:         pt_exchange_problem.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2019
rem

create table mtab (pcol number)
partition by list (pcol) (
        partition p1 values (1),
        partition p2 values (2)
);

alter table mtab add col2 number default 0 not null;

prompt  ========================================
prompt  Traditional creation method => ORA-14097
prompt  ========================================

create table mtab_p2 as select * from mtab where 1=0;
alter table mtab exchange partition P2 with table mtab_p2;

prompt  ===================
prompt  Create for exchange
prompt  ===================

drop table mtab_p2 purge;
create table mtab_p2 for exchange with table mtab;
alter table mtab exchange partition P2 with table mtab_p2;

Here’s the output from running this on an instance of 18.3

Table created.

Table altered.

========================================
Traditional creation method => ORA-14097
========================================

Table created.

alter table mtab exchange partition P2 with table mtab_p2
*
ERROR at line 1:
ORA-14097: column type or size mismatch in ALTER TABLE EXCHANGE PARTITION

===================
Create for exchange
===================

Table dropped.


Table created.


Table altered.

So we don’t have to worry about problems creating an exchange table in Oracle 12c or later. But we do still have a problem if we’re trying to convert our range-partitioned table into a range/list composite partitioned table by doing using the “double-exchange” method. In my simple example I used a “create table” statement to create an empty table that we could exchange into; but without another special version of a “create table” command I won’t be able to create a composite partitioned table that is compatible with the simple table that I want to use as my intermediate table.

Here’s the solution to that problem – first in a thumbnail sketch:

  • create a table for exchange (call it table C)
  • alter table C modify to change it to a composite partitioned table with one subpartition per partition
  • create a table for exchange (call it table E)
  • Use table E to exchange partitions from the original table to the (now-partitioned) table C
  • Split each partition of table C into the specific subpartitions required

And now some code to work through the details – first the code to create and populate the partitioned table.


rem
rem     Script:         pt_comp_from_pt_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2019
rem

drop table t purge;
drop table pt_range purge;
drop table pt_range_list purge;

-- @@setup

create table pt_range (
        id              number(8,0)     not null,
        grp             varchar2(1)     not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range(id) (
        partition p200 values less than (200),
        partition p400 values less than (400),
        partition p600 values less than (600)
)
;

insert into pt_range
select
        rownum-1,
        mod(rownum,2),
        lpad(rownum,10,'0'),
        rpad('x',100,'x')
from
        all_objects
where
	rownum <= 600 -- > comment to avoid WordPress format issue
;

commit;

Then some code to create the beginnings of the target composite partitioned table. We create a simple heap table “for exchange”, then modify it to be a composite partitioned table with a named starting partition and high_value and a template defining a single subpartition then, as a variant on the example from last week, specifying interval partitioning.


prompt	==========================================
prompt	First nice feature - "create for exchange"
prompt	==========================================

create table pt_range_list for exchange with table pt_range;

prompt	============================================
prompt	Now alter the table to composite partitioned
prompt	============================================

alter table pt_range_list modify
partition by range(id) interval (200)
subpartition by list (grp) 
subpartition template (
        subpartition p_def      values(default)
)
(
	partition p200 values less than (200)
)
;

If you want to do the conversion from range partitioning to interval partitioning you will have to check very carefully that your original table will be able to convert safely – which means you’ll need to check that the “high_value” values for the partitions are properly spaced to match the interval you’ve defined and (as a special requirement for the conversion) there are no omissions from the current list of high values. If your original table doesn’t match these requirement exactly you may end up trying to exchange data into a partition where it doesn’t belong; for example, if my original table had partitions with high value of 200, 600, 800 then there may be values in the 200-399 range currently stored in the original “600” range partition which shouldn’t go into the new “600” interval partition. You may find you have to split (and/or merge) a few partitions in your range-partitioned table before you can do the main conversion.

Now we create create the table that we’ll actually use for the exchange and go through each exchange in turn. Because I’ve got an explicitly named starting partition the first exchange takes only two steps – exchange out, exchange in. But because I’m using interval partitioning in the composite partitioned table I’m doing a “lock partition” before the second exchange on all the other partitions as this will bring the required target partition into existence. I’m also using the “[sub]partition for()” syntax to identify the pairs of [sub]partitions – this isn’t necessary for the original range-partitioned table, of course, but it’s the only way I can identify the generated subpartitions that will appear in the composite partitioned table.


create table t for exchange with table pt_range;

prompt	=======================================================================
prompt	Double exchange to move a partition to become a composite subpartition
prompt	Could drive this programatically by picking one row from each partition
prompt	=======================================================================

alter table pt_range exchange partition p200 with table t;
alter table pt_range_list exchange subpartition p200_p_def with table t;

alter table pt_range exchange partition for (399) with table t;
lock  table pt_range_list partition for (399) in exclusive mode;
alter table pt_range_list exchange subpartition for (399,'0') with table t;

alter table pt_range exchange partition for (599) with table t;
lock  table pt_range_list partition for (599) in exclusive mode;
alter table pt_range_list exchange subpartition for (599,'0') with table t;

prompt	=====================================
prompt	Show that we've got the data in place
prompt	=====================================

execute dbms_stats.gather_table_stats(user,'pt_range_list',granularity=>'ALL')

break on partition_name skip 1

select  partition_name, subpartition_name, num_rows 
from    user_tab_subpartitions 
where   table_name = 'PT_RANGE_LIST'
order by
        partition_name, subpartition_name
;

Now that the data is in the target table we can step through each partition in turn splitting its single subpartitions into the four subpartitions that we want. To cater for the future, though, I’ve first modified the subpartition template so that any future partitions will have four subpartitions (though the naming convention won’t be applied, of course, Oracle will use system-generated names for all new partitions and subpartitions).


prompt  ================================================
prompt  Change the subpartition template to what we want
prompt  ================================================

alter table pt_range_list
set subpartition template(
        subpartition p_0 values (0),
        subpartition p_1 values (1),
        subpartition p_2 values (2),
        subpartition p_def values (default)
)
;

prompt  ====================================================
prompt  Second nice feature - multiple splits in one command
prompt  Again, first split is fixed name.
prompt  We could do this online after allowing the users in
prompt  ====================================================

alter table pt_range_list split subpartition p200_p_def
        into (
                subpartition p200_p_0 values(0),
                subpartition p200_p_1 values(1),
                subpartition p200_p_2 values(2),
                subpartition p200_p_def
        )
;

alter table pt_range_list split subpartition for (399,'0')
        into (
                subpartition p400_p_0 values(0),
                subpartition p400_p_1 values(1),
                subpartition p400_p_2 values(2),
                subpartition p400_p_def
        )
;

alter table pt_range_list split subpartition for (599,'0')
        into (
                subpartition p600_p_0 values(0),
                subpartition p600_p_1 values(1),
                subpartition p600_p_2 values(2),
                subpartition p600_p_def
        )
;

Finally a little demonstration that we can’t add an explicitly named partition to the interval partitioned table; then we insert a row to generate the partition and show that it has 4 subpartitions.

Finally we rename everything (though that’s a fairly pointless exercise).


prompt  ==============================================================
prompt  Could try adding a partition to show it uses the new template
prompt  But that's not allowed for interval partitions: "ORA-14760:"
prompt  ADD PARTITION is not permitted on Interval partitioned objects
prompt  So insert a value that would go into the next (800) partition
prompt  ==============================================================

alter table pt_range_list add partition p800 values less than (800);

insert into pt_range_list (
        id, grp, small_vc, padding
)
values ( 
        799, '0', lpad(799,10,'0'), rpad('x',100,'x')
)
;

commit;

prompt  ===================================================
prompt  Template naming is not used for the subpartitions,
prompt  so we have to use the "subpartition for()" strategy 
prompt  ===================================================

alter table pt_range_list rename subpartition for (799,'0') to p800_p_0;
alter table pt_range_list rename subpartition for (799,'1') to p800_p_1;
alter table pt_range_list rename subpartition for (799,'2') to p800_p_2;
alter table pt_range_list rename subpartition for (799,'3') to p800_p_def;

prompt  ==============================================
prompt  Might as well clean up the partition names too
prompt  ==============================================

alter table pt_range_list rename partition for (399) to p400;
alter table pt_range_list rename partition for (599) to p600;
alter table pt_range_list rename partition for (799) to p800;

prompt  =======================================
prompt  Finish off by listing the subpartitions 
prompt  =======================================

execute dbms_stats.gather_table_stats(user,'pt_range_list',granularity=>'ALL')

select  partition_name, subpartition_name, num_rows 
from    user_tab_subpartitions 
where   table_name = 'PT_RANGE_LIST'
order by
        partition_name, subpartition_name
;

It’s worth pointing out that you could do the exchanges (and the splitting and renaming at the same time) through some sort of simple PL/SQL loop – looping through the named partitions in the original table and using a row from the first exchange to drive the lock and second exchange (and splitting and renaming). For exanple something like the following which doesn’t have any of the error-trapping and defensive mechanisms you’d want to use on a production system:



declare
        m_pt_val number;
begin
        for r in (select partition_name from user_tab_partitions where table_name = 'PT_RANGE' order by partition_position) 
        loop
                execute immediate
                        'alter table pt_range exchange partition ' || r.partition_name ||
                        ' with table t';
        
                select id into m_pt_val from t where rownum = 1;
        
                execute immediate 
                        'lock table pt_range_list partition for (' || m_pt_val || ') in exclusive mode';
        
                execute immediate
                        'alter table pt_range_list exchange subpartition  for (' || m_pt_val || ',0)' ||
                        ' with table t';
        
        end loop;
end;
/

If you do go for a programmed loop you have to be really careful to consider what could go wrong at each step of the loop and how your program is going to report (and possibly attempt to recover) the situation. This is definitely a case where you don’t want code with “when others then null” appearing anywhere, and don’t be tempted to include code to truncate the exchange table.

 

May 24, 2019

Re-partitioning – 18

Filed under: 18c,Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 12:50 pm BST May 24,2019

In yesterday’s note on the options for converting a range-partioned table into a composite range/list parititioned table I mentioned that you could do this online with a single command in 18c, so here’s some demonstration code to demonstrate that claim:


rem
rem     Script:         pt_comp_from_pt_18.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2019
rem

create table pt_range (
        id              number(8,0)     not null,
        grp             varchar2(1)     not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range(id) (
        partition p200 values less than (200),
        partition p400 values less than (400),
        partition p600 values less than (600)
)
;

insert into pt_range
select
        rownum - 1,
        mod(rownum,2),
        lpad(rownum,10,'0'),
        rpad('x',100,'x')
from
        all_objects
where
        rownum <= 600 -- > comment to avoid WordPress format issue
;

commit;

alter table pt_range modify
partition by range(id) interval (200)
subpartition by list (grp) 
subpartition template (
        subpartition p_0 values (0),
        subpartition p_1 values (1),
        subpartition p_2 values (2),
        subpartition p_def values (default)
)
(
        partition p200 values less than (200)
)
-- online
;

execute dbms_stats.gather_table_stats(null, 'pt_range', granularity=>'all')

break on partition_name skip 1

select  partition_name, subpartition_name, num_rows 
from    user_tab_subpartitions 
where   table_name = 'PT_RANGE'
order by
        partition_name, subpartition_name
;

Run this (with or without the online option) and you’ll (probably) see the Oracle error “ORA-00604: error occurred at recursive SQL level 1” with one of two underlying errors:

    ORA-01950: no privileges on tablespace 'SYSTEM'
    ORA-01536: space quota exceeded for tablespace 'SYSTEM'

So what’s gone wrong – it ought to work.

After enabling a 10046 trace I repeated the “alter table” command then scanned the trace file for the text “err=1950” (that being the error I’d received on my first attempt) and scanned backwards for the “PARSING IN CURSOR” line with a matching cursor id:


ERROR #139721552722200:err=1950 tim=26541227462

PARSING IN CURSOR #139721552722200 len=182 dep=1 uid=104 oct=1 lid=0 tim=26541224560 hv=2451601965 ad='7f1377267890' sqlid='0wsjfgk920yjd'
create table  "TEST_USER"."SYS_RMTAB$$_H124028"  ( src_rowid rowid not null , tgt_rowid rowid not null) 
    segment creation immediate nologging 
    tablespace  "SYSTEM"  
    rowid_mapping_table
END OF STMT

The code is trying to create a “rowid_mapping_table” in the system tablespace and I have no quota for the tablespace. (The 124028 in the table name relates to the object_id of the table I was trying to modify, by the way.)

The source of the error offered a big clue about a possible workaround sp I gave myself a quota (unlimited) on the system tablespace (alter user test_user quota unlimited on system) and that made it possible for the restructuring to take place. It’s not really an appropriate workaround for a production system though – especially if you’re using the online option and the table is subject to a lot of change.  (Note – this “rowid_mapping_table” and a “journal” table are created even if you haven’t selected the online option.)

Footnotes

  • The problem has been fixed in 19c (tested on LiveSQL)  and is listed on MoS as Bug 27580976 : INTERNAL RECURSIVE MAPPING TABLE MISTAKENLY PLACE IN SYSTEM FOR ONLINE OPS. There are no patches for 18c at present.
  • After I’d finished testing the quota workaround I tried to deprive myself of the quota on the system tablespace. I may have missed something in the manuals but it looks like the only way to do this is to give myself a zero quota (or, as I have done occasionally in the past, drop user cascade) because there’s no option for “quota denied” or “revoke quota” . This is why you may get one of two different messages after the ORA-00604. If you’ve never had a quota on the system tablespace you’ll get the “ORA-1950: no privileges” message, if you’ve had a quota at some time in the pasat and then had it set to zero’ you’ll get the “ORA-01536: space quota exceeded” message.

 

May 23, 2019

Re-partitioning

Filed under: 12c,Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 11:45 am BST May 23,2019

I wrote a short note a little while ago demonstrating how flexible Oracle 12.2 can be about physically rebuilding a table online to introduce or change the partitioning while discarding data, and so on.  But what do you do (as a recent question on ODC asked) if you want to upgrade a customer’s database to meet the requirements of a new release of your application by changing a partitioned table into a composite partitioned table and don’t have enough room to do an online rebuild. Which could require two copies of the data to exist at the same time.)

If you’ve got the down time (and not necessarily a lot is needed) you can fall back on “traditional methods” with some 12c enhancements. Let’s start with a range partitioned table:


rem
rem     Script:         pt_comp_from_pt.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2019
rem

create table pt_range (
        id              number(8,0)     not null,
        grp             varchar2(1)     not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range(id) (
        partition p200 values less than (200),
        partition p400 values less than (400),
        partition p600 values less than (600)
)
;

insert into pt_range
select
        rownum-1,
        mod(rownum,2),
        lpad(rownum,10,'0'),
        rpad('x',100,'x')
from
        all_objects
where
        rownum <= 600
;

commit;

So we’ve got a range-partitioned table with three partitions and some data in each partition. Let’s pretend we want to change this to range/list with the grp column as the subpartition key, allowing explicit use of values 0,1,2 and a bucket subpartition for anything else. First we create an empty version of the table with a suitable subpartition template, and a simple heap table to be used as an exchange medium:


create table pt_range_list (
        id              number(8,0)     not null,
        grp             varchar2(1)     not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range(id)
subpartition by list (grp)
subpartition template (
        subpartition p_def      values(default)
)
(
        partition p200 values less than (200),
        partition p400 values less than (400),
        partition p600 values less than (600)
)
;

prompt  ===============================================
prompt  First nice 12.2 feature - "create for exchange"
prompt  ===============================================

create table t for exchange with table pt_range;

You’ll notice that our subpartition template identifies just a single subpartition that takes default values – i.e. anything for which no explicit subpartition has been identified. This means we have a one to one correspondance between the data segments of the original table and the copy table. So now we go through a tedious loop (which we could code up with a PL/SQL “execute immediate” approach) to do a double-exchange for each partition in turn. (Any PL/SQL code is left as an exercise to the interested reader.)


alter table pt_range exchange partition p200 with table t;
alter table pt_range_list exchange subpartition p200_p_def with table t;

alter table pt_range exchange partition p400 with table t;
alter table pt_range_list exchange subpartition p400_p_def with table t;

alter table pt_range exchange partition p600 with table t;
alter table pt_range_list exchange subpartition p600_p_def with table t;

prompt  =====================================
prompt  Show that we've got the data in place
prompt  =====================================

execute dbms_stats.gather_table_stats(user,'pt_range_list',granularity=>'ALL')

break on partition_name skip 1

select  partition_name, subpartition_name, num_rows
from    user_tab_subpartitions
where   table_name = 'PT_RANGE_LIST'
order by
        partition_name, subpartition_name
;


PARTITION_NAME         SUBPARTITION_NAME        NUM_ROWS
---------------------- ---------------------- ----------
P200                   P200_P_DEF                    200

P400                   P400_P_DEF                    200

P600                   P600_P_DEF                    200


3 rows selected.

We now have to split the newly arrived subpartitions into the 4 pieces we want – but before we do that let’s make sure that any new partitions automatically have the correct subpartitions by changing the subpartition template:


alter table pt_range_list
set subpartition template(
        subpartition p_0 values (0),
        subpartition p_1 values (1),
        subpartition p_2 values (2),
        subpartition p_def values (default)
)
;

prompt  =========================================================
prompt  Second nice 12.2 feature - multiple splits in one command
prompt  We could do this online after allowing the users back on.
prompt  =========================================================

alter table pt_range_list split subpartition p200_p_def
        into (
                subpartition p200_p_0 values(0),
                subpartition p200_p_1 values(1),
                subpartition p200_p_2 values(2),
                subpartition p200_p_def
        )
;

alter table pt_range_list split subpartition p400_p_def
        into (
                subpartition p400_p_0 values(0),
                subpartition p400_p_1 values(1),
                subpartition p400_p_2 values(2),
                subpartition p400_p_def
        )
;

alter table pt_range_list split subpartition p600_p_def
        into (
                subpartition p600_p_0 values(0),
                subpartition p600_p_1 values(1),
                subpartition p600_p_2 values(2),
                subpartition p600_p_def
        )
;

Now, just to check that everything is behaving, let’s add a new partition, and check to see what partitions and subpartitions we end up with:


alter table pt_range_list add partition p800 values less than (800);

execute dbms_stats.gather_table_stats(user,'pt_range_list',granularity=>'ALL')

select  partition_name, subpartition_name, num_rows
from    user_tab_subpartitions
where   table_name = 'PT_RANGE_LIST'
order by
        partition_name, subpartition_name
;

PARTITION_NAME         SUBPARTITION_NAME        NUM_ROWS
---------------------- ---------------------- ----------
P200                   P200_P_0                      100
                       P200_P_1                      100
                       P200_P_2                        0
                       P200_P_DEF                      0

P400                   P400_P_0                      100
                       P400_P_1                      100
                       P400_P_2                        0
                       P400_P_DEF                      0

P600                   P600_P_0                      100
                       P600_P_1                      100
                       P600_P_2                        0
                       P600_P_DEF                      0

P800                   P800_P_0                        0
                       P800_P_1                        0
                       P800_P_2                        0
                       P800_P_DEF                      0


16 rows selected.

And as a final note – if we decide we want to put it all back we could merge four subpartitions down to one subpartition with a single command – then loop through every partition in turn:


alter table pt_range_list
        merge subpartitions  p200_p_0, p200_p_1, p200_p_2, p200_p_def
        into  subpartition  p200_p_def
;

And now I feel like I’m turning into Tim Hall – writing potentially useful demonstrations instead of trying to baffle people with rocket science. But I hope to get over that quite soon. Of course I have left out some important “real-world” details – particularly how you choose to handle indexes while doing the double-exchange. My view would be to take the local indexes with you on the exchange, bypass the global indexes on the exchange out, and be choosy about which global indexes to maintain on the exchange back in; but it all depends on how much downtime you have, how many indexes there are, and the state they’re likely to start or end in.

As ever it’s possible to start with a simple idea like this, then discover there are real-world complications that have to be dealt with. So there’s another article in the pipeline to handle a slightly more complex case. I’ll also be publishing a short note about the easy way of getting the job done from 18c onwards – if you’ve got the spare resources.

 

March 13, 2019

Hash Partitions

Filed under: Oracle,Partitioning,Statistics — Jonathan Lewis @ 1:13 pm GMT Mar 13,2019

Here’s an important thought if you’ve got any large tables which are purely hash partitioned. As a general guideline you should not need partition level stats on those tables. The principle of hash partitioned tables is that the rows are distributed uniformly and randomly based on the hash key so, with the assumption that the number of different hash keys is “large” compared to the number of partitions, any one partition should look the same as any other partition.

Consider, as a thought experiment (and as a warning), a table of product_deliveries which is hash partitioned by product_id with ca. 65,000 distinct products that have been hashed across 64 partitions. (Hash partitioning should always use a power of 2 for the partition count if you want the number of rows per partition to be roughly the same across all partitions – if you don’t pick a power of two then some of the partitions will be roughly twice the size of others.)

Consider a query for “deliveries to Basingstoke” – in the absence of a histogram on the delivery location the optimizer will produce a cardinality estimate that is:

  • total rows in table / number of distinct delivery locations in table

Now consider a query for: “deliveries of product X to Basingstoke” – again in the absence of histograms. The optimizer could have two ways of calculating this cardinality:

  • total rows in table / (number of distinct products in table * number of distinct delivery locations in table)
  • total rows in relevant partition / (number of distinct products in relevant partition * number of distinct delivery locations in relevant partition)

But given the intent of hash partitioning to distribute data evenly we can make three further observations:

  1. the number of rows in any one partition should be very similar to the number of rows in the table divided by the number of partitions
  2. the number of distinct products in any one partition should be very similar to the number of products in the table divided by the number of partitions
  3. the number of distinct locations in any one partition should be very similar to the number of distinct locations in the whole table.

The second condition holds because product is the partition key, the third holds because location is not the partition key.

So we can rewrite the second, partition-oriented, formula as:

  • (total rows in table / number of partitions) / ((number of distinct products in table / number of partitions) * number of distinct locations in table)

which, re-arranging parentheses and cancelling common factors, reduces to:

  • total rows in table / (number of distinct products in table * number of distinct locations in table)

which matches the first formula. (Q.E.D.) In the absence of any statistics on hash partitions the optimizer can (ought to be able to) produce reasonable cardinality estimates based purely on table-level stats.

In fact if you look back into the history of partitioning this observation is implicit in the early days of composite partitioning when the only option was for range/hash composite partitions – the optimizer never used sub-partition stats to calculate costs or cardinality it used only partition-level statistics. (And it was several years before the optimizer caught up to the fact that (e.g.) range/list composite partitioning might actually need to do arithmetic based on subpartition stats.)

I did say that the example was also a warning. Hash partitioning is “expected” to have a large number of distinct key values compared to the number of partitions. (If you don’t meet this requirement then possibly you should be using list partitioning). There’s also a “uniformity” assumption built into the arithmetic (both the basic arithmetic and the hand-waving discussion I produced above). Just imagine that your company supplies a handful of products that for some strange reason are incredibly popular  in Basingstoke. If this is the case then the assumption that “all partitions look alike” is weakened and you would have to consider the possibility that the variation would require you to produce a workaround to address problems of poor cardinality estimates that the variation might produce.

A pattern of this type has two generic effects on the optimizer, of course. First is the simple skew in the data – to have a significant impact the number of rows for the problem products would have to be much larger than average, which suggests the need for a suitably crafted histogram; secondly there’s an implied correlation between a few products and Basingstoke, so you might even end up creating a column group and manually coding a histogram on it to capture the correlation.

 

March 6, 2019

12c Snapshots

Filed under: 12c,Oracle,Partitioning,Performance — Jonathan Lewis @ 10:35 am GMT Mar 6,2019

I published a note a few years ago about using the 12c “with function” mechanism for writing simple SQL statements to takes deltas of dynamic performance views. The example I supplied was for v$event_histogram but I’ve just been prompted by a question on ODC to supply a couple more – v$session_event and v$sesstat (joined to v$statname) so that you can use one session to get an idea of the work done and time spent by another session – the first script reports wait time:


rem
rem     Program:        12c_with_function_2.sql
rem     Dated:          July 2013
rem     Author:         J.P.Lewis
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session WAIT time
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 121, 127'

set timing on
set sqlterminator off

set linesize 180

break on sid skip 1

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        sum(total_waits),
        sum(total_timeouts), 
        sum(time_waited), 
        event
from    (
        select
                sid, event_id, 
                -total_waits total_waits, 
                -total_timeouts total_timeouts, 
                -time_waited time_waited, 
                -time_waited_micro time_waited_micro, 
                event
        from    v$session_event
        where   sid in ( &m_sid_list )
        union all
        select
                null, null, null, null, null, wait_row(&m_snap_time, 0), null
        from    dual
        union all
        select
                sid, event_id, total_waits, total_timeouts, time_waited, time_waited_micro, event
        from    v$session_event
        where   sid in ( &m_sid_list )
        )
where
        time_waited_micro != 0
group by
        sid, event_id, event
having
        sum(time_waited) != 0
order by
        sid, sum(time_waited) desc
/


And this one reports session activity:

rem
rem     Program:        12c_with_function_3.sql
rem     Dated:          July 2013
rem     Author:         J.P.Lewis
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session stats
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 13, 357'


set timing on
set sqlterminator off

set linesize 180

break on sid skip 1
column name format a64

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        name,
        sum(value)
from    (
        select
                ss.sid, 
                ss.statistic#,
                sn.name,
                -ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        union all
        select
                null, null, null, wait_row(&m_snap_time, 0)
        from    dual
        union all
        select
                ss.sid, ss.statistic#, sn.name, ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        )
where
        value != 0
group by
        sid, statistic#, name
having
        sum(value) != 0
order by
        sid, statistic#
/


You’ll notice that I’ve used dbms_lock.sleep() in my wait function – and the session running the SQL can be granted the execute privilege on the package through a role to make this work – but if you’re running Oracle 18 then you’ve probably noticed that the sleep() function and procedure have been copied to the dbms_session package.

 

June 1, 2018

Index Bouncy Scan 4

Filed under: 12c,Execution plans,Index skip scan,Indexing,Oracle,Partitioning,Performance — Jonathan Lewis @ 9:19 am BST Jun 1,2018

There’s always another hurdle to overcome. After I’d finished writing up the “index bouncy scan” as an efficient probing mechanism to find the combinations of the first two columns (both declared not null) of a very large index a follow-up question appeared almost immediately: “What if it’s a partitioned index”.

The difficulty introduced by “typical” partitioned indexes is that the smallest value of the leading column might appear in any of the partitions and the combination of that value and the smallest value for the second column might not appear in all the partitions where that first smallest value appears.

Consider a table of 10 partitions and a locally partitioned index on (val1, val2) where neither column is the partition key. The smallest value of val1 – call it k1 may appear only in partitions 4, 7, 8, 9, 10; the lowest combination of (val1, val2) – call it (k1, k2) may appear only in partitions 8 and 10. In a global (or globally partitioned) index the pair (k1, k2) would be at the low (leftmost) end of the index, but to find the pair in a locally partitioned index we have to probe the leftmost end of 10 separate index partitions – and once we’ve done that each “bounce” requires us to probe 10 index partitions for the first (val1, val2) pair where val1 = k1 and val2 is just just greater than k2, or val1 is just greater than k1 and val2 is the minimum for that value of val1. The more partitions we have the greater the number of index partitions we have to probe at each step and the more likely it is that we ought to switch to a brute force index fast full scan with aggregate.

Here’s the starting point for solving the problem (maybe) – I’ll create a simple partitioned table, and use the “bouncy scan” code from the earlier posting with the table and column names adjusted accordingly:


rem
rem     Script:         bouncy_index_3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10)  -- > comment to avoid WordPress format issue
;

alter table pt1 modify(status not null);

execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

prompt  ==================================================
prompt  Make some rows in the last partition have a status
prompt  that won't be found in the first partition.
prompt  ==================================================

column namespace format 99999999
column partition_name new_value m_part

select  partition_name
from    user_tab_partitions
where   table_name = 'PT1'
order by
        partition_position
;

update pt1 partition (&m_part) set status = 'MISSING' where rownum <= 10;

select
        dbms_mview.pmarker(rowid), status, namespace
from    pt1
where   status = 'MISSING'
;

I’ve created a hash partitioned copy of view all_objects, duplicating it 10 times, and created a local index on the columns (status, namespace). My data has two values for status, ‘VALID’ and ‘INVALID’, and there are about 10 values for the namespace. I’ve then updated a few rows in the last partition, giving them a status value that is between the two current values – this is just one little test case to help me check that my code is going to catch all values even if they don’t appear in the first table partition.

Here’s the query from the earlier posting – and it does get the right results – followed by the execution plan:


alter session set statistics_level = all;

set serveroutput off
set linesize 180
set pagesize 60

prompt  =============================================================
prompt  Original Query, showing expensive access for driving minimums
prompt  =============================================================

with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select
                        /*+ index(pt1) no_index_ffs(pt1) */
                        status, namespace,
                        row_number() over(order by status, namespace) rn
                from    pt1
        )
        where
                rn = 1
        union all
        select
                v1.status, v1.namespace
        from    bounce1,
                lateral (
                              select  /*+ index(pt1) no_index_ffs(pt1) no_decorrelate */
                                      pt1.status, pt1.namespace
                              from    pt1
                              where   pt1.status > bounce1.status
                              and     rownum = 1
                ) v1
        where   bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last outline'));


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        | 16378 (100)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 | 16378   (4)|       |       |     10 |00:00:00.58 |    1869 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 | 16377   (4)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     12 |00:00:00.58 |    1869 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |  8157   (4)|       |       |      2 |00:00:00.58 |    1747 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      2 |00:00:00.58 |    1747 |  1024 |  1024 | 2048  (0)|
|*  6 |       VIEW                                   |                 |      1 |      1 |  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |                 |      1 |    617K|  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |  2048 |  2048 | 2048  (0)|
|   8 |         PARTITION HASH ALL                   |                 |      1 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.34 |    1732 |       |       |          |
|   9 |          INDEX FULL SCAN                     | PT1_I1          |      4 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.15 |    1732 |       |       |          |
|  10 |       NESTED LOOPS                           |                 |      2 |      1 |  4110   (4)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  11 |        RECURSIVE WITH PUMP                   |                 |      2 |        |            |       |       |      2 |00:00:00.01 |       0 |       |       |          |
|  12 |        VIEW                                  | VW_LAT_1BBF5C63 |      2 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|* 13 |         COUNT STOPKEY                        |                 |      2 |        |            |       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  14 |          PARTITION HASH ALL                  |                 |      2 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|* 15 |           INDEX RANGE SCAN                   | PT1_I1          |      5 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|  16 |     SORT AGGREGATE                           |                 |     10 |      1 |            |       |       |     10 |00:00:00.01 |     122 |       |       |          |
|  17 |      PARTITION HASH ALL                      |                 |     10 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  18 |       FIRST ROW                              |                 |     40 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     122 |       |       |          |
|* 19 |        INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     40 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  20 |     RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     10 |00:00:00.01 |       0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
   4 - filter("BOUNCE1"."STATUS" IS NOT NULL)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "STATUS","NAMESPACE") .le. 1)     -- changed "less than or equal to" symbol to .le.
  13 - filter(ROWNUM=1)
  15 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  19 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

In terms of time the query doesn’t seem to have done too badly – but I’m only using a small data set and we can see from the numbers that we haven’t produced an efficient plan. Operations 8 and 9 tell us that we’ve done an index full scan on every single partition before passing the data up for a window sort operation. That’s clearly a bad thing, but we did have an index() hint at that bit of code that worked very well for the simple (global) index so maybe we should have taken that out before testing (except it doesn’t help much to do so since Oracle still scans all 617K rows, changing to an index fast full scan).

Apart from that massive load the rest of the query looks quite efficient. We keep seeing “partition hash all” of course – whatever we do we tend to do it to 4 separate partitions one after the other – but everything else we do looks rather efficient. But there is another problem – and this is where the importance of inserting the rows with status = ‘MISSING’ shows up: this query didn’t find them! We have a predicate “rownum = 1” in the second half of the bounce1 recursive subquery and because we’re using a partitioned index we’ve managed to find a row that looks appropriate in an early partition when the row we really needed doesn’t appear until the last partition.

Let’s return to this problem later – first we want to check if the rest of the query will run efficiently and give us the right answer if we can find some way of getting the starting values; so let’s use a strategy we’ve used before – replace the bounce1 subquery with a union all select from dual:


with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select 'INVALID' status, 1 namespace from dual
                union all
                select 'MISSING', 4 from dual
                union all
                select 'VALID', 1 from dual
        )
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last partition outline'));

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |        |      1 |        |    76 (100)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   1 |  SORT ORDER BY                             |        |      1 |      6 |    76   (2)|       |       |     11 |00:00:00.01 |     132 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |        |      1 |      6 |    75   (0)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|        |      1 |        |            |       |       |     14 |00:00:00.01 |     132 |  1024 |  1024 |          |
|   4 |     VIEW                                   |        |      1 |      3 |     6   (0)|       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   5 |      UNION-ALL                             |        |      1 |        |            |       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   6 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   7 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   8 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   9 |     SORT AGGREGATE                         |        |     11 |      1 |            |       |       |     11 |00:00:00.01 |     132 |       |       |          |
|  10 |      PARTITION HASH ALL                    |        |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  11 |       FIRST ROW                            |        |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     132 |       |       |          |
|* 12 |        INDEX RANGE SCAN (MIN/MAX)          | PT1_I1 |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  13 |     RECURSIVE WITH PUMP                    |        |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
  12 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

This gets us the right answer, very efficiently. There are only 11 rows in the result set and we have an average 12 buffer visits per row – which is reasonble given that we (probably) have to probe 4 index partitions for every row. So that’s 11 * 4 * 3 buffer visits per probe – which seems just about optimal.

The next step is to figure out a way of getting the (three in our case) starting points while using a partitioned index. Here’s a query we can use for bounce1:


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
)
select * from bounce1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

It looks a little convoluted with all the inline select statements but they all do very small amounts of work and they’re only reading the index leaf blocks that you have to read. We know from yesterday’s post that Oracle can execute the scalar subqueries at lines 3 and 4 very efficiently; we can hope (and check) that the lateral() subqueries driven by the single values from the recursive row in bounce1 will operate just as efficiently – and here’s the plan:


----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                 |      1 |        |   166 (100)|       |       |      4 |00:00:00.01 |     132 |
|   1 |  VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   3 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   4 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   5 |      INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   6 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   7 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   8 |      FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  9 |       INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  11 |         PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  12 |          INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  13 |    FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  14 |    NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  15 |     NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  16 |      RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  17 |      VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |       SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  19 |        PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  20 |         FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 21 |          INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |     VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  23 |      SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  24 |       PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  25 |        FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 26 |         INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |           PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |            FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |             INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - access("STATUS"=)
  21 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  26 - access("PT1"."STATUS"=)
  30 - access("PT2"."STATUS">:B1)

Although we have done lots of individual probes into the index they have all been very efficient using a min/max access and an average of about 3 buffer visits per probe. So we can now insert this new bounce1 subquery into the previous query in place of the union all of dual and check that the two pieces of the query cooperate.


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace from bounce1
        union all
        select  bounce2.status, (select min(t.namespace) namespace from pt1 t where t.namespace > bounce2.namespace and status=bounce2.status) namespace
        from    bounce2
        where   bounce2.status is not null
        and     bounce2.namespace is not null
)
select  *
from    bounce2
where   namespace is not null
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |                 |      1 |        |   396 (100)|       |       |     11 |00:00:00.01 |     266 |
|*  1 |  VIEW                                       |                 |      1 |      4 |   396   (1)|       |       |     11 |00:00:00.01 |     266 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     15 |00:00:00.01 |     266 |
|   3 |    VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   4 |     UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   5 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   8 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   9 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  12 |          SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  13 |           PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  14 |            INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  15 |      FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  16 |      NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  17 |       NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |        RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  19 |        VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  20 |         SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  21 |          PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |           FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 23 |            INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  24 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  25 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  26 |         PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 28 |           INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  29 |            SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  30 |             PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  31 |              FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 32 |               INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  33 |    SORT AGGREGATE                           |                 |     11 |      1 |            |       |       |     11 |00:00:00.01 |     134 |
|  34 |     PARTITION HASH ALL                      |                 |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  35 |      FIRST ROW                              |                 |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     134 |
|* 36 |       INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  37 |    RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |
------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("NAMESPACE" IS NOT NULL)
  11 - access("STATUS"=)
  23 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  28 - access("PT1"."STATUS"=)
  32 - access("PT2"."STATUS">:B1)
  36 - access("STATUS"=:B1 AND "T"."NAMESPACE">:B2)

Job done. We’ve found the distinct set of pairs without having to scan the entire index. We’ve found 11 pairs at a total cost of 266 buffer gets. For comparitive purposes the query totalled 56 buffer visits when I recreated the table as a non-partitioned table (again updating a few rows to status = ‘MISSING’).

It’s important to note that this query can only work this efficiently in 12.2 (and possibly in a suitably patched 11.2.0.4) because of the optimizer’s ability to use the min/max operation for queries like: “select max(col2) where col1 = (select max()…))”. When I ran the final query on 12.1.0.2 the execution plan changed around lines 11 and 28 where 12.2.0.1 could use the aggregate subquery to drive the min/max scan 12.1.0.2 did a real range scan with aggregate (which was extremely expensive at one point).

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|  23 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |  1206   (2)|       |       |      3 |00:00:05.43 |    2414 |
|  24 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:05.43 |    2414 |
|  25 |         PARTITION HASH ALL                  |                 |      3 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:05.84 |    2414 |
|* 26 |          INDEX RANGE SCAN                   | PT1_I1          |     12 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:02.03 |    2414 |
|  27 |           SORT AGGREGATE                    |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |            PARTITION HASH ALL               |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |             FIRST ROW                       |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |              INDEX RANGE SCAN (MIN/MAX)     | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see, this makes a dramatic difference to the work Oracle has to do – in this case 2,414 buffer gets and 845K rows examined. As I said in yestrday’s post – there’s a patch for 11.2.0.4, so there could be a patch for 12.1.0.2 if you ask for it, but it looks like no-one has done so yet.

Footnote:

I could have used a lateral() view in the first half of bounce1 to reduce the reported number of probes of pt1_i1 in the plan – but it made the code extremely messy, I had to include a /*+ no_decorrelate */ hint in it, and it increased the number of buffer visits slightly because the optimizer seemed to lose the option for a min/max scan in this particular lateral join.

 

May 31, 2018

Min/Max upgrade

Filed under: 12c,Indexing,Oracle,Partitioning,Performance,Upgrades — Jonathan Lewis @ 2:13 pm BST May 31,2018

Here’s a nice little optimizer enhancement that appeared in 12.2 to make min/max range scans (and full scans) available in more circumstances. Rather than talk through it, here’s a little demonstration:

rem
rem     Script:         122_minmax.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1        Good path
rem             12.1.0.2        Bad path

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10) -- > comment to avoid format wordpress issue
;

alter table pt1 modify(status not null);

execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

alter session set statistics_level = all;
set serveroutput off
set linesize 156
set pagesize 60
set trimspool on

select  min(status) from pt1;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = 'INVALID';
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = (select min(status) from pt1);
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

The basic “min/max” optimisation allows Oracle to avoid a massive sort aggregate – Oracle doesn’t need to acquire a lot of data and sort it when it knows that the “left hand” end of an index is the low values and the “right hand” is the high values so, for example, in the first query above the optimizer could simply walk down the index branches to the left hand leaf and look at the single lowest entry in the leaf block to determine the lowest value for status … if the index had been a global index.

Things get a little messy, though, when the index is locally partitioned and your query isn’t about the partition key and there’s no suitable global index. Once upon a time (IIRC) Oracle would simply have to do an index fast full scan across all index partitions to handle such a query, but some time ago it got a lot cleverer and was enhanced to do a min/max scan on each partition in turn getting one value per partition very efficiently, then aggregating across those values to find the global minimum.

Here are the three execution plans (with rowsource execution stats pulled from memory) taken from 12.1.0.2 for the queries above:


-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   3 |    INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
-----------------------------------------------------------------------------------------------------------------------------


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE               |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL          |        |      1 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
|   3 |    FIRST ROW                  |        |      4 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      12 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
-------------------------------------------------------------------------------------------------------------------------------


-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |        |      1 |        |   337 (100)|       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   1 |  SORT AGGREGATE                |        |      1 |      1 |            |       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   2 |   PARTITION HASH ALL           |        |      1 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|*  3 |    INDEX FAST FULL SCAN        | PT1_I1 |      4 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|   4 |     SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |      0 |
|   5 |      PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
|   6 |       INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
-----------------------------------------------------------------------------------------------------------------------------------------

In the first plan Oracle has done an “index full scan (min/max)” across each of the four partitions in turn to return one row very cheaply from each, then aggregated to find the overall minimum.

In the second plan Oracle has done an “index range scan (min/max)” in exactly the same way, since it was able to find the start point in the index for the status ‘INVALID’ very efficiently.

In the third plan Oracle has been able to find the minimum value for the status (‘INVALID’) very efficiently in the subquery, and has passed that single value up to the main query, which has then used a brute force approach to search the whole of every partition of the index for every occurrence (all 10 of them) of the value ‘INVALID’ and then aggregated them to find the minimum namespace. Despite “knowing”, by the time the main query runs, that there will be a single value to probe for the status, the optimizer has not anticipated the fact that the final query will effectively become the same as the preceding one. As a result we’ve read 2,242 data blocks into the cache.

Turn, then, to the execution plan from 12.2.0.1 for this last query:


---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      24 |
|   1 |  SORT AGGREGATE                 |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   2 |   PARTITION HASH ALL            |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   3 |    FIRST ROW                    |        |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)  | PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   5 |      SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
---------------------------------------------------------------------------------------------------------------------------------

In 12.2 you can see that the main query is now doing an “index range scan (min/max)” on each index partition in turn, based on the incoming (though unknown at parse time) single value from the subquery. As a result the total work done is a mere 24 buffer visits.

There have been a couple of occasions in the past where I’ve had to write some PL/SQL to work around little details like this. It’s nice to know simple tables and partitioned tables with local indexes can now behave the same way. I also wonder whether there may be sites that could drop (or drop columns from, or make local) some indexes that they’ve previously created to  handle queries of the “most recent occurrence” type.

If, for any reason, you need to disable this enhancement, it’s controlled by fix_control (v$system_fix_control) “18915345 Allow MIN/MAX optimization for pred having single row subquery” which can be set in the startup file, at the system level, or in the session.

Update

Checking MoS for the bug number I found that the limitation had been reported for 11.2.0.3, with “Fixed in product version” reported as 12.2; but there are patches for various releases of 11.2.0.4, though none yet for 12.1.0.2 – but if you think you need it you can always try raising an SR.

 

Next Page »

Website Powered by WordPress.com.