Oracle Scratchpad

August 9, 2019

Split Partition

Filed under: Oracle,Partitioning,Performance,Tuning — Jonathan Lewis @ 1:02 pm BST Aug 9,2019

This is a little case study on “pre-emptive trouble-shooting”, based on a recent question on the ODC database forum asking about splitting a range-based partition into two at a value above the current highest value recorded in a max_value partition.

The general principle for splitting (range-based) partitions is that if the split point is above the current high value Oracle will recognise that it can simply rename the existing partition and create a new, empty partition, leaving all the indexes (including the global and globally partitioned indexes) in a valid state. There are, however, three little wrinkles to this particular request:

  • first is that the question relates to a system running 10g
  • second is that there is a LOB column in the table
  • third is that the target is to have the new (higher value) partition(s) in a different tablespace

It’s quite possible that 10g won’t have all the capabilities of partition maintenance of newer versions, and if anything is going to go wrong LOBs are always a highly dependable point of failure, and since all the examples in the manuals tend to be very simple examples maybe any attempt to introduce complications like tablespace specification will cause problems.

So, before you risk doing the job in production, what are you going to test?

In Oracle terms we want to check the following

  • Will Oracle have silently copied/rebuilt some segments rather than simply renaming old segments and creating new, empty segments.
  • Will the segments end up where we want them
  • Will all the indexes stay valid

To get things going, the OP had supplied a framework for the table and told us about two indexes, and had then given us two possible SQL statements to do the split, stating they he (or she) had tested them and they both worked. Here’s the SQL (with a few tweaks) that creates the table and indexes. I’ve also added some data – inserting one row into each partition.

rem
rem     Script:         split_pt_lob.sql
rem     Author:         Jonathan Lewis
rem     Dated:          July 2019
rem
rem     Last tested 
rem             12.2.0.1
rem             10.2.0.5
rem

define m_old_ts = 'test_8k'
define m_new_ts = 'assm_2'

drop table part_tab purge;

create table part_tab(
  pt_id            NUMBER,
  pt_name          VARCHAR2(30),
  pt_date          DATE default SYSDATE,
  pt_lob           CLOB,
  pt_status        VARCHAR2(2)
)
tablespace &m_old_ts
lob(pt_lob) store as (tablespace &m_old_ts)
partition by range (pt_date)
(
  partition PRT1 values less than (TO_DATE('2012-01-01', 'YYYY-MM-DD')),
  partition PRT2 values less than (TO_DATE('2014-09-01', 'YYYY-MM-DD')),
  partition PRT_MAX values less than (MAXVALUE)
)
/

alter table part_tab
add constraint pt_pk primary key(pt_id)
/

create index pt_i1 on part_tab(pt_date, pt_name) local
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    1,'one',to_date('01-Jan-2011'),rpad('x',4000),'X'
)
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    2,'two',to_date('01-Jan-2013'),rpad('x',4000),'X'
)cascade=>trueee
/

insert into part_tab(
    pt_id, pt_name, pt_date, pt_lob, pt_status
)
values(
    3,'three',to_date('01-Jan-2015'),rpad('x',4000),'X'
)
/

commit;

execute dbms_stats.gather_table_stats(null,'part_tab',cascade=>true,granularity=>'ALL')

We were told that

The table has
– Primary Key on pt_id column with unique index (1 Different table has FK constraint that refers to this PK)
– Composite index on pt_date and pt_name columns

This is why I’ve added a primary key constraint (which will generate a global index) and created an index on (pt_date,pt_name) – which I’ve created as a local index since it contains the partitioning column.

The description of the requirement was:

  • The Task is to split partition(PRT_MAX) to a different tablespace
  • New partition(s) won’t have data at the moment of creation

And the two “tested” strategies were:

alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
        PARTITION PRT3    tablespace &m_old_ts,
        PARTITION PRT_MAX tablespace &m_new_ts
);

alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
        PARTITION PRT3    tablespace &m_old_ts LOB (pt_lob) store as (TABLESPACE &m_old_ts), 
        PARTITION PRT_MAX tablespace &m_new_ts LOB (pt_lob) store as (TABLESPACE &m_new_ts)
)
;
 

If we’re going to test these strategies properly we will need queries similar to the following:


break on object_name skip 1
select object_name, subobject_name, object_id, data_object_id  from user_objects order by object_name, subobject_name;

break on index_name skip 1
select index_name, status from user_indexes;
select index_name, partition_name, status from user_ind_partitions order by index_name, partition_name;

break on segment_name skip 1
select segment_name, partition_name, tablespace_name from user_segments order by segment_name, partition_name;

First – what are the object_id and data_object_id for each object before and after the split. Have we created new “data objects” while splitting, or has an existing data (physical) object simply changed its name.

Secondly – are there any indexes or index partitions that are no longer valid

Finally – which tablespaces do physical objects reside in.

On a test run of the first, simpler, split statement here are the before and after results for the object_id and data_object_id, followed by the post-split results for index and segment details:


Before Split
============

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23677          23677
                                 PRT2                        23678          23678
                                 PRT_MAX                     23679          23679
                                                             23676

PT_I1                            PRT1                        23690          23690
                                 PRT2                        23691          23691
                                 PRT_MAX                     23692          23692
                                                             23689

PT_PK                                                        23688          23688

SYS_IL0000023676C00004$$         SYS_IL_P252                 23685          23685
                                 SYS_IL_P253                 23686          23686
                                 SYS_IL_P254                 23687          23687

SYS_LOB0000023676C00004$$        SYS_LOB_P249                23681          23681
                                 SYS_LOB_P250                23682          23682
                                 SYS_LOB_P251                23683          23683
                                                             23680          23680

After split
===========

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23677          23677
                                 PRT2                        23678          23678
                                 PRT3                        23693          23679
                                 PRT_MAX                     23679          23694
                                                             23676

PT_I1                            PRT1                        23690          23690
                                 PRT2                        23691          23691
                                 PRT3                        23700          23692
                                 PRT_MAX                     23699          23699
                                                             23689

PT_PK                                                        23688          23688

SYS_IL0000023676C00004$$         SYS_IL_P252                 23685          23685
                                 SYS_IL_P253                 23686          23686
                                 SYS_IL_P257                 23697          23687
                                 SYS_IL_P258                 23698          23698

SYS_LOB0000023676C00004$$        SYS_LOB_P249                23681          23681
                                 SYS_LOB_P250                23682          23682
                                 SYS_LOB_P255                23695          23683
                                 SYS_LOB_P256                23696          23696
                                                             23680          23680


INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            VALID
SYS_IL0000023676C00004$$         N/A


INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   USABLE
                                 PRT_MAX                USABLE

SYS_IL0000023676C00004$$         SYS_IL_P252            USABLE
                                 SYS_IL_P253            USABLE
                                 SYS_IL_P257            USABLE
                                 SYS_IL_P258            USABLE


SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023676C00004$$  SYS_IL_P252            TEST_8K
                          SYS_IL_P253            TEST_8K
                          SYS_IL_P257            TEST_8K
                          SYS_IL_P258            TEST_8K

SYS_LOB0000023676C00004$$ SYS_LOB_P249           TEST_8K
                          SYS_LOB_P250           TEST_8K
                          SYS_LOB_P255           TEST_8K
                          SYS_LOB_P256           TEST_8K

Before the split partition PRT_MAX – with 4 segments: table, index, LOB, LOBINDEX – has object_id = data_object_id, with the values: 23679 (table), 23692 (index), 23683 (LOB), 23687 (LOBINDEX); and after the split these reappear as the data_object_id values for partition PRT3 (though the object_id values are larger than the data_object_id values) – so we infer that Oracle has simply renamed the various PRT_MAX objects to PRT3 and created new, empty PRT_MAX objects.

We can also see that all the indexes (including the global primary key index) have remained valid. We also note that the data_object_id of the primary key index has not changed, so Oracle didn’t have to rebuild it to ensure that it stayed valid.

There is a problem, though, the LOB segment and LOBINDEX segments for the new PRT_MAX partition are not in the desired target tablespace. So we need to check the effects of the second version of the split command where we add the specification of the LOB tablespaces. This is what we get – after rerunning the entire test script from scratch:


OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23727          23727
                                 PRT2                        23728          23728
                                 PRT_MAX                     23729          23729
                                                             23726

PT_I1                            PRT1                        23740          23740
                                 PRT2                        23741          23741
                                 PRT_MAX                     23742          23742
                                                             23739

PT_PK                                                        23738          23738

SYS_IL0000023726C00004$$         SYS_IL_P272                 23735          23735
                                 SYS_IL_P273                 23736          23736
                                 SYS_IL_P274                 23737          23737

SYS_LOB0000023726C00004$$        SYS_LOB_P269                23731          23731
                                 SYS_LOB_P270                23732          23732
                                 SYS_LOB_P271                23733          23733
                                                             23730          23730


OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23727          23727
                                 PRT2                        23728          23728
                                 PRT3                        23743          23743
                                 PRT_MAX                     23729          23744
                                                             23726

PT_I1                            PRT1                        23740          23740
                                 PRT2                        23741          23741
                                 PRT3                        23750          23750
                                 PRT_MAX                     23749          23749
                                                             23739

PT_PK                                                        23738          23738

SYS_IL0000023726C00004$$         SYS_IL_P272                 23735          23735
                                 SYS_IL_P273                 23736          23736
                                 SYS_IL_P277                 23747          23747
                                 SYS_IL_P278                 23748          23748

SYS_LOB0000023726C00004$$        SYS_LOB_P269                23731          23731
                                 SYS_LOB_P270                23732          23732
                                 SYS_LOB_P275                23745          23745
                                 SYS_LOB_P276                23746          23746
                                                             23730          23730

INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            UNUSABLE
SYS_IL0000023726C00004$$         N/A

INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   UNUSABLE
                                 PRT_MAX                USABLE

SYS_IL0000023726C00004$$         SYS_IL_P272            USABLE
                                 SYS_IL_P273            USABLE
                                 SYS_IL_P277            USABLE
                                 SYS_IL_P278            USABLE

SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023726C00004$$  SYS_IL_P272            TEST_8K
                          SYS_IL_P273            TEST_8K
                          SYS_IL_P277            TEST_8K
                          SYS_IL_P278            ASSM_2

SYS_LOB0000023726C00004$$ SYS_LOB_P269           TEST_8K
                          SYS_LOB_P270           TEST_8K
                          SYS_LOB_P275           TEST_8K
                          SYS_LOB_P276           ASSM_2


Before looking at the more complex details the first thing that leaps out to hit the eye is the word UNUSABLE – which appears for the status of the (global) primary key index and the PRT3 subpartition. The (empty) PRT_MAX LOB and LOBINDEX partitions are where we wanted them, but by specifying the location we seem to have broken two index segments that will need to be rebuilt.

It gets worse, because if we check the data_object_id of the original PRT_MAX partition (23729) and its matching index partition (23742) we see that they don’t correspond to the (new) PRT3 data_object_id values which are 23743 and 23750 respectively – the data has been physically copied from one data object to another completely unnecessarily; moreover the same applies to the LOB and LOBINDEX segments – the data object ids for the PRT_MAX LOB and LOBINDEX partitions were 23733 and 23737, the new PRT3 data object ids are 23746 and 23747.

If you did a test with only a tiny data set you might not notice the implicit threat that these changes in data_object_id tell you about – you’re going to be copying the whole LOB segment when you don’t need to.

Happy Ending (maybe)

A quick check with 12.2 suggested that Oracle had got much better at detecting that it didn’t need to copy LOB data and invalidate indexes with the second form of the code; but the OP was on 10g – so that’s not much help. However it was the thought that Oracle might misbehave when you specifyied tablespaces that made me run up this test – in particular I had wondered if specifying a tablespace for the partition that would end up holding the existing data might trigger an accident, so here’s a third variant of the split statement I tested, with the results on the indexes, segments, and data objects. Note that I specify the tablespace only for the new (empty) segments:


alter table part_tab split partition PRT_MAX at(TO_DATE('2019-08-01', 'YYYY-MM-DD')) into (
    PARTITION PRT3,
    PARTITION PRT_MAX tablespace &m_new_ts  LOB (pt_lob) store as (TABLESPACE &m_new_ts)
)
/

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23752          23752
                                 PRT2                        23753          23753
                                 PRT_MAX                     23754          23754
                                                             23751

PT_I1                            PRT1                        23765          23765
                                 PRT2                        23766          23766
                                 PRT_MAX                     23767          23767
                                                             23764

PT_PK                                                        23763          23763

SYS_IL0000023751C00004$$         SYS_IL_P282                 23760          23760
                                 SYS_IL_P283                 23761          23761
                                 SYS_IL_P284                 23762          23762

SYS_LOB0000023751C00004$$        SYS_LOB_P279                23756          23756
                                 SYS_LOB_P280                23757          23757
                                 SYS_LOB_P281                23758          23758
                                                             23755          23755

OBJECT_NAME                      SUBOBJECT_NAME          OBJECT_ID DATA_OBJECT_ID
-------------------------------- ---------------------- ---------- --------------
PART_TAB                         PRT1                        23752          23752
                                 PRT2                        23753          23753
                                 PRT3                        23768          23754
                                 PRT_MAX                     23754          23769
                                                             23751

PT_I1                            PRT1                        23765          23765
                                 PRT2                        23766          23766
                                 PRT3                        23775          23767
                                 PRT_MAX                     23774          23774
                                                             23764

PT_PK                                                        23763          23763

SYS_IL0000023751C00004$$         SYS_IL_P282                 23760          23760
                                 SYS_IL_P283                 23761          23761
                                 SYS_IL_P287                 23772          23762
                                 SYS_IL_P288                 23773          23773

SYS_LOB0000023751C00004$$        SYS_LOB_P279                23756          23756
                                 SYS_LOB_P280                23757          23757
                                 SYS_LOB_P285                23770          23758
                                 SYS_LOB_P286                23771          23771
                                                             23755          23755
INDEX_NAME                       STATUS
-------------------------------- --------
PT_I1                            N/A
PT_PK                            VALID
SYS_IL0000023751C00004$$         N/A

INDEX_NAME                       PARTITION_NAME         STATUS
-------------------------------- ---------------------- --------
PT_I1                            PRT1                   USABLE
                                 PRT2                   USABLE
                                 PRT3                   USABLE
                                 PRT_MAX                USABLE

SYS_IL0000023751C00004$$         SYS_IL_P282            USABLE
                                 SYS_IL_P283            USABLE
                                 SYS_IL_P287            USABLE
                                 SYS_IL_P288            USABLE

SEGMENT_NAME              PARTITION_NAME         TABLESPACE_NAME
------------------------- ---------------------- ------------------------------
PART_TAB                  PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_I1                     PRT1                   TEST_8K
                          PRT2                   TEST_8K
                          PRT3                   TEST_8K
                          PRT_MAX                ASSM_2

PT_PK                                            TEST_8K

SYS_IL0000023751C00004$$  SYS_IL_P282            TEST_8K
                          SYS_IL_P283            TEST_8K
                          SYS_IL_P287            TEST_8K
                          SYS_IL_P288            ASSM_2

SYS_LOB0000023751C00004$$ SYS_LOB_P279           TEST_8K
                          SYS_LOB_P280           TEST_8K
                          SYS_LOB_P285           TEST_8K
                          SYS_LOB_P286           ASSM_2

All the index and index partitions stay valid; the new empty segments all end up in the target tablespace, and all the data object ids for the old PRT_MAX partitions becaome the data object ids for the new PRT3 partitions. Everything we want, and no physical rebuilds of any data sets.

Moral:

When you’re testing, especially when you’re doing a small test while anticipating a big data set, don’t rely on the clock; check the data dictionary (and trace files, if necessary) carefully to find out what activity actually took place.

Footnote:

It’s possible that there are ways to fiddle around with the various default attributes of the partitioned table to get the same effect – but since 12.2 is much better behaved anyway there’s no point in me spending more time looking for alternative solutions to a 10g problem.

 

May 27, 2019

Re-partitioning 2

Filed under: 12c,Infrastructure,Oracle,Partitioning,Uncategorized — Jonathan Lewis @ 8:20 pm BST May 27,2019

Last week I wrote a note about turning a range-partitioned table into a range/list composite partitioned table using features included in 12.2 of Oracle. But my example was really just an outline of the method and bypassed a number of the little extra problems you’re likely to see in a real-world system, so in this note I’m going to bring in an issue that you might run into – and which I’ve seen appearing a number of times: ORA-14097: column type or size mismatch in ALTER TABLE EXCHANGE PARTITION.

It’s often the case that a system has a partitioned table that’s been around for a long time, and over its lifetime it may have had (real or virtual) columns added, made inivisble, dropped, or mark unused. As a result you may find that the apparent definition of the table is not the same as the real definition of the table – and that’s why Oracle has given us (in 12c) the option to “create table for exchange”.

You might like to read a MoS note giving you one example of a problem with creating an exchange table prior to this new feature. ORA-14097 At Exchange Partition After Adding Column With Default Value (Doc ID 1334763.1) I’ve created a little model by cloning the code from that note.


rem
rem     Script:         pt_exchange_problem.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2019
rem

create table mtab (pcol number)
partition by list (pcol) (
        partition p1 values (1),
        partition p2 values (2)
);

alter table mtab add col2 number default 0 not null;

prompt  ========================================
prompt  Traditional creation method => ORA-14097
prompt  ========================================

create table mtab_p2 as select * from mtab where 1=0;
alter table mtab exchange partition P2 with table mtab_p2;

prompt  ===================
prompt  Create for exchange
prompt  ===================

drop table mtab_p2 purge;
create table mtab_p2 for exchange with table mtab;
alter table mtab exchange partition P2 with table mtab_p2;

[/sourcecode}


Here's the output from running this on an instance of 18.3


Table created.

Table altered.

========================================
Traditional creation method => ORA-14097
========================================

Table created.

alter table mtab exchange partition P2 with table mtab_p2
*
ERROR at line 1:
ORA-14097: column type or size mismatch in ALTER TABLE EXCHANGE PARTITION

===================
Create for exchange
===================

Table dropped.


Table created.


Table altered.

So we don’t have to worry about problems creating an exchange table in Oracle 12c or later. But we do still have a problem if we’re trying to convert our range-partitioned table into a range/list composite partitioned table by doing using the “double-exchange” method. In my simple example I used a “create table” statement to create an empty table that we could exchange into; but without another special version of a “create table” command I won’t be able to create a composite partitioned table that is compatible with the simple table that I want to use as my intermediate table.

Here’s the solution to that problem – first in a thumbnail sketch:

  • create a table for exchange (call it table C)
  • alter table C modify to change it to a composite partitioned table with one subpartition per partition
  • create a table for exchange (call it table E)
  • Use table E to exchange partitions from the original table to the (now-partitioned) table C
  • Split each partition of table C into the specific subpartitions required

And now some code to work through the details – first the code to create and populate the partitioned table.


rem
rem     Script:         pt_comp_from_pt_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2019
rem

drop table t purge;
drop table pt_range purge;
drop table pt_range_list purge;

-- @@setup

create table pt_range (
        id              number(8,0)     not null,
        grp             varchar2(1)     not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range(id) (
        partition p200 values less than (200),
        partition p400 values less than (400),
        partition p600 values less than (600)
)
;

insert into pt_range
select
        rownum-1,
        mod(rownum,2),
        lpad(rownum,10,'0'),
        rpad('x',100,'x')
from
        all_objects
where
	rownum <= 600 -- > comment to avoid WordPress format issue
;

commit;

Then some code to create the beginnings of the target composite partitioned table. We create a simple heap table “for exchange”, then modify it to be a composite partitioned table with a named starting partition and high_value and a template defining a single subpartition then, as a variant on the example from last week, specifying interval partitioning.


prompt	==========================================
prompt	First nice feature - "create for exchange"
prompt	==========================================

create table pt_range_list for exchange with table pt_range;

prompt	============================================
prompt	Now alter the table to composite partitioned
prompt	============================================

alter table pt_range_list modify
partition by range(id) interval (200)
subpartition by list (grp) 
subpartition template (
        subpartition p_def      values(default)
)
(
	partition p200 values less than (200)
)
;

If you want to do the conversion from range partitioning to interval partitioning you will have to check very carefully that your original table will be able to convert safely – which means you’ll need to check that the “high_value” values for the partitions are properly spaced to match the interval you’ve defined and (as a special requirement for the conversion) there are no omissions from the current list of high values. If your original table doesn’t match these requirement exactly you may end up trying to exchange data into a partition where it doesn’t belong; for example, if my original table had partitions with high value of 200, 600, 800 then there may be values in the 200-399 range currently stored in the original “600” range partition which shouldn’t go into the new “600” interval partition. You may find you have to split (and/or merge) a few partitions in your range-partitioned table before you can do the main conversion.

Now we create create the table that we’ll actually use for the exchange and go through each exchange in turn. Because I’ve got an explicitly named starting partition the first exchange takes only two steps – exchange out, exchange in. But because I’m using interval partitioning in the composite partitioned table I’m doing a “lock partition” before the second exchange on all the other partitions as this will bring the required target partition into existence. I’m also using the “[sub]partition for()” syntax to identify the pairs of [sub]partitions – this isn’t necessary for the original range-partitioned table, of course, but it’s the only way I can identify the generated subpartitions that will appear in the composite partitioned table.


create table t for exchange with table pt_range;

prompt	=======================================================================
prompt	Double exchange to move a partition to become a composite subpartition
prompt	Could drive this programatically by picking one row from each partition
prompt	=======================================================================

alter table pt_range exchange partition p200 with table t;
alter table pt_range_list exchange subpartition p200_p_def with table t;

alter table pt_range exchange partition for (399) with table t;
lock  table pt_range_list partition for (399) in exclusive mode;
alter table pt_range_list exchange subpartition for (399,'0') with table t;

alter table pt_range exchange partition for (599) with table t;
lock  table pt_range_list partition for (599) in exclusive mode;
alter table pt_range_list exchange subpartition for (599,'0') with table t;

prompt	=====================================
prompt	Show that we've got the data in place
prompt	=====================================

execute dbms_stats.gather_table_stats(user,'pt_range_list',granularity=>'ALL')

break on partition_name skip 1

select  partition_name, subpartition_name, num_rows 
from    user_tab_subpartitions 
where   table_name = 'PT_RANGE_LIST'
order by
        partition_name, subpartition_name
;

Now that the data is in the target table we can split each default subpartition into the four subpartitions that we want for each partition. To cater for the future, though, I’ve first modified the subpartition template so that each new partition will have four subpartitions (though the naming convention won’t be applied, of course, Oracle will generate system name for all new partitions and subpartitions).


prompt  ================================================
prompt  Change the subpartition template to what we want
prompt  ================================================

alter table pt_range_list
set subpartition template(
        subpartition p_0 values (0),
        subpartition p_1 values (1),
        subpartition p_2 values (2),
        subpartition p_def values (default)
)
;

prompt  ====================================================
prompt  Second nice feature - multiple splits in one command
prompt  Again, first split is fixed name.
prompt  We could do this online after allowing the users in
prompt  ====================================================

alter table pt_range_list split subpartition p200_p_def
        into (
                subpartition p200_p_0 values(0),
                subpartition p200_p_1 values(1),
                subpartition p200_p_2 values(2),
                subpartition p200_p_def
        )
;

alter table pt_range_list split subpartition for (399,'0')
        into (
                subpartition p400_p_0 values(0),
                subpartition p400_p_1 values(1),
                subpartition p400_p_2 values(2),
                subpartition p400_p_def
        )
;

alter table pt_range_list split subpartition for (599,'0')
        into (
                subpartition p600_p_0 values(0),
                subpartition p600_p_1 values(1),
                subpartition p600_p_2 values(2),
                subpartition p600_p_def
        )
;

Finally a little demonstration that we can’t add an explicitly named partition to the interval partitioned table; then we insert a row to generate the partition and show that it has 4 subpartitions.

Finishing off we rename everything (though that’s a fairly pointless exercise).


prompt  ==============================================================
prompt  Could try adding a partition to show it uses the new template
prompt  But that's not allowed for interval partitions: "ORA-14760:"
prompt  ADD PARTITION is not permitted on Interval partitioned objects
prompt  So insert a value that would go into the next (800) partition
prompt  ==============================================================

alter table pt_range_list add partition p800 values less than (800);

insert into pt_range_list (
        id, grp, small_vc, padding
)
values ( 
        799, '0', lpad(799,10,'0'), rpad('x',100,'x')
)
;

commit;

prompt  ===================================================
prompt  Template naming is not used for the subpartitions,
prompt  so we have to use the "subpartition for()" strategy 
prompt  ===================================================

alter table pt_range_list rename subpartition for (799,'0') to p800_p_0;
alter table pt_range_list rename subpartition for (799,'1') to p800_p_1;
alter table pt_range_list rename subpartition for (799,'2') to p800_p_2;
alter table pt_range_list rename subpartition for (799,'3') to p800_p_def;

prompt  ==============================================
prompt  Might as well clean up the partition names too
prompt  ==============================================

alter table pt_range_list rename partition for (399) to p400;
alter table pt_range_list rename partition for (599) to p600;
alter table pt_range_list rename partition for (799) to p800;

prompt  =======================================
prompt  Finish off by listing the subpartitions 
prompt  =======================================

execute dbms_stats.gather_table_stats(user,'pt_range_list',granularity=>'ALL')

select  partition_name, subpartition_name, num_rows 
from    user_tab_subpartitions 
where   table_name = 'PT_RANGE_LIST'
order by
        partition_name, subpartition_name
;

It’s worth pointing out that you could do the exchanges (and the splitting and renaming at the same time) through some sort of simple PL/SQL loop – looping through the named partitions in the original table and using a row from the first exchange to drive the lock and second exchange (and splitting and renaming). For exanple something like the following which doesn’t have any of the error-trapping and defensive mechanisms you’d want to use on a production system:



declare
        m_pt_val number;
begin
        for r in (select partition_name from user_tab_partitions where table_name = 'PT_RANGE' order by partition_position) 
        loop
                execute immediate
                        'alter table pt_range exchange partition ' || r.partition_name ||
                        ' with table t';
        
                select id into m_pt_val from t where rownum = 1;
        
                execute immediate 
                        'lock table pt_range_list partition for (' || m_pt_val || ') in exclusive mode';
        
                execute immediate
                        'alter table pt_range_list exchange subpartition  for (' || m_pt_val || ',0)' ||
                        ' with table t';
        
        end loop;
end;
/

If you do go for a programmed loop you have to be really careful to consider what could go wrong at each step of the loop and how your program is going to report (and possibly attempt to recover) the situation. This is definitely a case where you don’t want code with “when others then null” appearing anywhere, and don’t be tempted to include code to truncate the exchange table.

 

May 24, 2019

Re-partitioning – 18

Filed under: 18c,Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 12:50 pm BST May 24,2019

In yesterday’s note on the options for converting a range-partioned table into a composite range/list parititioned table I mentioned that you could do this online with a single command in 18c, so here’s some demonstration code to demonstrate that claim:


rem
rem     Script:         pt_comp_from_pt_18.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2019
rem

create table pt_range (
        id              number(8,0)     not null,
        grp             varchar2(1)     not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range(id) (
        partition p200 values less than (200),
        partition p400 values less than (400),
        partition p600 values less than (600)
)
;

insert into pt_range
select
        rownum - 1,
        mod(rownum,2),
        lpad(rownum,10,'0'),
        rpad('x',100,'x')
from
        all_objects
where
        rownum <= 600 -- > comment to avoid WordPress format issue
;

commit;

alter table pt_range modify
partition by range(id) interval (200)
subpartition by list (grp) 
subpartition template (
        subpartition p_0 values (0),
        subpartition p_1 values (1),
        subpartition p_2 values (2),
        subpartition p_def values (default)
)
(
        partition p200 values less than (200)
)
-- online
;

execute dbms_stats.gather_table_stats(null, 'pt_range', granularity=>'all')

break on partition_name skip 1

select  partition_name, subpartition_name, num_rows 
from    user_tab_subpartitions 
where   table_name = 'PT_RANGE'
order by
        partition_name, subpartition_name
;

Run this (with or without the online option) and you’ll (probably) see the Oracle error “ORA-00604: error occurred at recursive SQL level 1” with one of two underlying errors:

    ORA-01950: no privileges on tablespace 'SYSTEM'
    ORA-01536: space quota exceeded for tablespace 'SYSTEM'

So what’s gone wrong – it ought to work.

After enabling a 10046 trace I repeated the “alter table” command then scanned the trace file for the text “err=1950” (that being the error I’d received on my first attempt) and scanned backwards for the “PARSING IN CURSOR” line with a matching cursor id:


ERROR #139721552722200:err=1950 tim=26541227462

PARSING IN CURSOR #139721552722200 len=182 dep=1 uid=104 oct=1 lid=0 tim=26541224560 hv=2451601965 ad='7f1377267890' sqlid='0wsjfgk920yjd'
create table  "TEST_USER"."SYS_RMTAB$$_H124028"  ( src_rowid rowid not null , tgt_rowid rowid not null) 
    segment creation immediate nologging 
    tablespace  "SYSTEM"  
    rowid_mapping_table
END OF STMT

The code is trying to create a “rowid_mapping_table” in the system tablespace and I have no quota for the tablespace. (The 124028 in the table name relates to the object_id of the table I was trying to modify, by the way.)

The source of the error offered a big clue about a possible workaround sp I gave myself a quota (unlimited) on the system tablespace (alter user test_user quota unlimited on system) and that made it possible for the restructuring to take place. It’s not really an appropriate workaround for a production system though – especially if you’re using the online option and the table is subject to a lot of change.  (Note – this “rowid_mapping_table” and a “journal” table are created even if you haven’t selected the online option.)

Footnotes

  • The problem has been fixed in 19c (tested on LiveSQL)  and is listed on MoS as Bug 27580976 : INTERNAL RECURSIVE MAPPING TABLE MISTAKENLY PLACE IN SYSTEM FOR ONLINE OPS. There are no patches for 18c at present.
  • After I’d finished testing the quota workaround I tried to deprive myself of the quota on the system tablespace. I may have missed something in the manuals but it looks like the only way to do this is to give myself a zero quota (or, as I have done occasionally in the past, drop user cascade) because there’s no option for “quota denied” or “revoke quota” . This is why you may get one of two different messages after the ORA-00604. If you’ve never had a quota on the system tablespace you’ll get the “ORA-1950: no privileges” message, if you’ve had a quota at some time in the pasat and then had it set to zero’ you’ll get the “ORA-01536: space quota exceeded” message.

 

May 23, 2019

Re-partitioning

Filed under: 12c,Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 11:45 am BST May 23,2019

I wrote a short note a little while ago demonstrating how flexible Oracle 12.2 can be about physically rebuilding a table online to introduce or change the partitioning while discarding data, and so on.  But what do you do (as a recent question on ODC asked) if you want to upgrade a customer’s database to meet the requirements of a new release of your application by changing a partitioned table into a composite partitioned table and don’t have enough room to do an online rebuild. Which could require two copies of the data to exist at the same time.)

If you’ve got the down time (and not necessarily a lot is needed) you can fall back on “traditional methods” with some 12c enhancements. Let’s start with a range partitioned table:


rem
rem     Script:         pt_comp_from_pt.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2019
rem

create table pt_range (
        id              number(8,0)     not null,
        grp             varchar2(1)     not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range(id) (
        partition p200 values less than (200),
        partition p400 values less than (400),
        partition p600 values less than (600)
)
;

insert into pt_range
select
        rownum-1,
        mod(rownum,2),
        lpad(rownum,10,'0'),
        rpad('x',100,'x')
from
        all_objects
where
        rownum <= 600
;

commit;

So we’ve got a range-partitioned table with three partitions and some data in each partition. Let’s pretend we want to change this to range/list with the grp column as the subpartition key, allowing explicit use of values 0,1,2 and a bucket subpartition for anything else. First we create an empty version of the table with a suitable subpartition template, and a simple heap table to be used as an exchange medium:


create table pt_range_list (
        id              number(8,0)     not null,
        grp             varchar2(1)     not null,
        small_vc        varchar2(10),
        padding         varchar2(100)
)
partition by range(id)
subpartition by list (grp)
subpartition template (
        subpartition p_def      values(default)
)
(
        partition p200 values less than (200),
        partition p400 values less than (400),
        partition p600 values less than (600)
)
;

prompt  ===============================================
prompt  First nice 12.2 feature - "create for exchange"
prompt  ===============================================

create table t for exchange with table pt_range;

You’ll notice that our subpartition template identifies just a single subpartition that takes default values – i.e. anything for which no explicit subpartition has been identified. This means we have a one to one correspondance between the data segments of the original table and the copy table. So now we go through a tedious loop (which we could code up with a PL/SQL “execute immediate” approach) to do a double-exchange for each partition in turn. (Any PL/SQL code is left as an exercise to the interested reader.)


alter table pt_range exchange partition p200 with table t;
alter table pt_range_list exchange subpartition p200_p_def with table t;

alter table pt_range exchange partition p400 with table t;
alter table pt_range_list exchange subpartition p400_p_def with table t;

alter table pt_range exchange partition p600 with table t;
alter table pt_range_list exchange subpartition p600_p_def with table t;

prompt  =====================================
prompt  Show that we've got the data in place
prompt  =====================================

execute dbms_stats.gather_table_stats(user,'pt_range_list',granularity=>'ALL')

break on partition_name skip 1

select  partition_name, subpartition_name, num_rows
from    user_tab_subpartitions
where   table_name = 'PT_RANGE_LIST'
order by
        partition_name, subpartition_name
;


PARTITION_NAME         SUBPARTITION_NAME        NUM_ROWS
---------------------- ---------------------- ----------
P200                   P200_P_DEF                    200

P400                   P400_P_DEF                    200

P600                   P600_P_DEF                    200


3 rows selected.

We now have to split the newly arrived subpartitions into the 4 pieces we want – but before we do that let’s make sure that any new partitions automatically have the correct subpartitions by changing the subpartition template:


alter table pt_range_list
set subpartition template(
        subpartition p_0 values (0),
        subpartition p_1 values (1),
        subpartition p_2 values (2),
        subpartition p_def values (default)
)
;

prompt  =========================================================
prompt  Second nice 12.2 feature - multiple splits in one command
prompt  We could do this online after allowing the users back on.
prompt  =========================================================

alter table pt_range_list split subpartition p200_p_def
        into (
                subpartition p200_p_0 values(0),
                subpartition p200_p_1 values(1),
                subpartition p200_p_2 values(2),
                subpartition p200_p_def
        )
;

alter table pt_range_list split subpartition p400_p_def
        into (
                subpartition p400_p_0 values(0),
                subpartition p400_p_1 values(1),
                subpartition p400_p_2 values(2),
                subpartition p400_p_def
        )
;

alter table pt_range_list split subpartition p600_p_def
        into (
                subpartition p600_p_0 values(0),
                subpartition p600_p_1 values(1),
                subpartition p600_p_2 values(2),
                subpartition p600_p_def
        )
;

Now, just to check that everything is behaving, let’s add a new partition, and check to see what partitions and subpartitions we end up with:


alter table pt_range_list add partition p800 values less than (800);

execute dbms_stats.gather_table_stats(user,'pt_range_list',granularity=>'ALL')

select  partition_name, subpartition_name, num_rows
from    user_tab_subpartitions
where   table_name = 'PT_RANGE_LIST'
order by
        partition_name, subpartition_name
;

PARTITION_NAME         SUBPARTITION_NAME        NUM_ROWS
---------------------- ---------------------- ----------
P200                   P200_P_0                      100
                       P200_P_1                      100
                       P200_P_2                        0
                       P200_P_DEF                      0

P400                   P400_P_0                      100
                       P400_P_1                      100
                       P400_P_2                        0
                       P400_P_DEF                      0

P600                   P600_P_0                      100
                       P600_P_1                      100
                       P600_P_2                        0
                       P600_P_DEF                      0

P800                   P800_P_0                        0
                       P800_P_1                        0
                       P800_P_2                        0
                       P800_P_DEF                      0


16 rows selected.

And as a final note – if we decide we want to put it all back we could merge four subpartitions down to one subpartition with a single command – then loop through every partition in turn:


alter table pt_range_list
        merge subpartitions  p200_p_0, p200_p_1, p200_p_2, p200_p_def
        into  subpartition  p200_p_def
;

And now I feel like I’m turning into Tim Hall – writing potentially useful demonstrations instead of trying to baffle people with rocket science. But I hope to get over that quite soon. Of course I have left out some important “real-world” details – particularly how you choose to handle indexes while doing the double-exchange. My view would be to take the local indexes with you on the exchange, bypass the global indexes on the exchange out, and be choosy about which global indexes to maintain on the exchange back in; but it all depends on how much downtime you have, how many indexes there are, and the state they’re likely to start or end in.

As ever it’s possible to start with a simple idea like this, then discover there are real-world complications that have to be dealt with. So there’s another article in the pipeline to handle a slightly more complex case. I’ll also be publishing a short note about the easy way of getting the job done from 18c onwards – if you’ve got the spare resources.

 

March 13, 2019

Hash Partitions

Filed under: Oracle,Partitioning,Statistics — Jonathan Lewis @ 1:13 pm BST Mar 13,2019

Here’s an important thought if you’ve got any large tables which are purely hash partitioned. As a general guideline you should not need partition level stats on those tables. The principle of hash partitioned tables is that the rows are distributed uniformly and randomly based on the hash key so, with the assumption that the number of different hash keys is “large” compared to the number of partitions, any one partition should look the same as any other partition.

Consider, as a thought experiment (and as a warning), a table of product_deliveries which is hash partitioned by product_id with ca. 65,000 distinct products that have been hashed across 64 partitions. (Hash partitioning should always use a power of 2 for the partition count if you want the number of rows per partition to be roughly the same across all partitions – if you don’t pick a power of two then some of the partitions will be roughly twice the size of others.)

Consider a query for “deliveries to Basingstoke” – in the absence of a histogram on the delivery location the optimizer will produce a cardinality estimate that is:

  • total rows in table / number of distinct delivery locations in table

Now consider a query for: “deliveries of product X to Basingstoke” – again in the absence of histograms. The optimizer could have two ways of calculating this cardinality:

  • total rows in table / (number of distinct products in table * number of distinct delivery locations in table)
  • total rows in relevant partition / (number of distinct products in relevant partition * number of distinct delivery locations in relevant partition)

But given the intent of hash partitioning to distribute data evenly we can make three further observations:

  1. the number of rows in any one partition should be very similar to the number of rows in the table divided by the number of partitions
  2. the number of distinct products in any one partition should be very similar to the number of products in the table divided by the number of partitions
  3. the number of distinct locations in any one partition should be very similar to the number of distinct locations in the whole table.

The second condition holds because product is the partition key, the third holds because location is not the partition key.

So we can rewrite the second, partition-oriented, formula as:

  • (total rows in table / number of partitions) / ((number of distinct products in table / number of partitions) * number of distinct locations in table)

which, re-arranging parentheses and cancelling common factors, reduces to:

  • total rows in table / (number of distinct products in table * number of distinct locations in table)

which matches the first formula. (Q.E.D.) In the absence of any statistics on hash partitions the optimizer can (ought to be able to) produce reasonable cardinality estimates based purely on table-level stats.

In fact if you look back into the history of partitioning this observation is implicit in the early days of composite partitioning when the only option was for range/hash composite partitions – the optimizer never used sub-partition stats to calculate costs or cardinality it used only partition-level statistics. (And it was several years before the optimizer caught up to the fact that (e.g.) range/list composite partitioning might actually need to do arithmetic based on subpartition stats.)

I did say that the example was also a warning. Hash partitioning is “expected” to have a large number of distinct key values compared to the number of partitions. (If you don’t meet this requirement then possibly you should be using list partitioning). There’s also a “uniformity” assumption built into the arithmetic (both the basic arithmetic and the hand-waving discussion I produced above). Just imagine that your company supplies a handful of products that for some strange reason are incredibly popular  in Basingstoke. If this is the case then the assumption that “all partitions look alike” is weakened and you would have to consider the possibility that the variation would require you to produce a workaround to address problems of poor cardinality estimates that the variation might produce.

A pattern of this type has two generic effects on the optimizer, of course. First is the simple skew in the data – to have a significant impact the number of rows for the problem products would have to be much larger than average, which suggests the need for a suitably crafted histogram; secondly there’s an implied correlation between a few products and Basingstoke, so you might even end up creating a column group and manually coding a histogram on it to capture the correlation.

 

March 6, 2019

12c Snapshots

Filed under: 12c,Oracle,Partitioning,Performance — Jonathan Lewis @ 10:35 am BST Mar 6,2019

I published a note a few years ago about using the 12c “with function” mechanism for writing simple SQL statements to takes deltas of dynamic performance views. The example I supplied was for v$event_histogram but I’ve just been prompted by a question on ODC to supply a couple more – v$session_event and v$sesstat (joined to v$statname) so that you can use one session to get an idea of the work done and time spent by another session – the first script reports wait time:


rem
rem     Program:        12c_with_function_2.sql
rem     Dated:          July 2013
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session WAIT time
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 121, 127'

set timing on
set sqlterminator off

set linesize 180

break on sid skip 1

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        sum(total_waits),
        sum(total_timeouts), 
        sum(time_waited), 
        event
from    (
        select
                sid, event_id, 
                -total_waits total_waits, 
                -total_timeouts total_timeouts, 
                -time_waited time_waited, 
                -time_waited_micro time_waited_micro, 
                event
        from    v$session_event
        where   sid in ( &m_sid_list )
        union all
        select
                null, null, null, null, null, wait_row(&m_snap_time, 0), null
        from    dual
        union all
        select
                sid, event_id, total_waits, total_timeouts, time_waited, time_waited_micro, event
        from    v$session_event
        where   sid in ( &m_sid_list )
        )
where
        time_waited_micro != 0
group by
        sid, event_id, event
having
        sum(time_waited) != 0
order by
        sid, sum(time_waited) desc
/


And this one reports session activity:

rem
rem     Program:        12c_with_function_3.sql
rem     Dated:          July 2013
rem
rem     See also
rem     12c_with_function.sql
rem     https://jonathanlewis.wordpress.com/2013/06/30/12c-fun/
rem
rem     Notes:
rem             Reports session stats
rem             Modify the list of SIDs of interest
rem             Set the time in seconds
rem

define m_snap_time = 60
define m_sid_list  = '3, 4, 13, 357'


set timing on
set sqlterminator off

set linesize 180

break on sid skip 1
column name format a64

with
        function wait_row (
                i_secs  number, 
                i_return        number
        ) return number
        is
        begin
                dbms_lock.sleep(i_secs);
                return i_return;
        end;
select
        sid, 
        name,
        sum(value)
from    (
        select
                ss.sid, 
                ss.statistic#,
                sn.name,
                -ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        union all
        select
                null, null, null, wait_row(&m_snap_time, 0)
        from    dual
        union all
        select
                ss.sid, ss.statistic#, sn.name, ss.value value
        from
                v$sesstat       ss,
                v$statname      sn
        where   ss.sid in ( &m_sid_list )
        and     sn.statistic# = ss.statistic#
        )
where
        value != 0
group by
        sid, statistic#, name
having
        sum(value) != 0
order by
        sid, statistic#
/


You’ll notice that I’ve used dbms_lock.sleep() in my wait function – and the session running the SQL can be granted the execute privilege on the package through a role to make this work – but if you’re running Oracle 18 then you’ve probably noticed that the sleep() function and procedure have been copied to the dbms_session package.

 

June 1, 2018

Index Bouncy Scan 4

Filed under: 12c,Execution plans,Indexing,Oracle,Partitioning,Performance — Jonathan Lewis @ 9:19 am BST Jun 1,2018

There’s always another hurdle to overcome. After I’d finished writing up the “index bouncy scan” as an efficient probing mechanism to find the combinations of the first two columns (both declared not null) of a very large index a follow-up question appeared almost immediately: “what if it’s a partitioned index”.

The problem with “typical” partitioned indexes is that the smallest value of the leading column might appear in any of the partitions, and the combination of that value and the smallest value for the second column might not appear in all the partitions where the smallest value appears. Consider a table of 10 partitions and a locally partitioned index on (val1, val2) where neither column is the partition key. The smallest value of val1 – call it k1 may appear only in partitions 4, 7, 8, 9, 10; the lowest combination of (val1, val2) – call it (k1, k2) may appear only in partitions 8 and 10. In a global (or globally partitioned) index the pair (k1, k2) would be at the low (leftmost) end of the index, but to find the pair in a locally partitioned index we have to probe the leftmost end of 10 separate index partitions – and once we’ve done that each “bounce” requires us to probe 10 index partitions for the first (val1, val2) pair where val1 = k1 and val2 is just just greater than k2, or val1 is just greater than k1 and val2 is the minimum for that value of val1. The more partitions we have the greater the number of index partitions we have to probe at each step and the more likely it is that we ought to switch to a brute force index fast full scan with aggregate.

Here’s the starting point for solving the problem (maybe) – I’ll create a simple partitioned table, and use the “bouncy scan” code from the earlier posting with the table and column names adjusted accordingly:


rem
rem     Script:         bouncy_index_3.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Apr 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10) ; alter table pt1 modify(status not null); execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

prompt  ==================================================
prompt  Make some rows in the last partition have a status
prompt  that won't be found in the first partition.
prompt  ==================================================

column namespace format 99999999
column partition_name new_value m_part

select  partition_name
from    user_tab_partitions
where   table_name = 'PT1'
order by
        partition_position
;

update pt1 partition (&m_part) set status = 'MISSING' where rownum <= 10;

select
        dbms_mview.pmarker(rowid), status, namespace
from    pt1
where   status = 'MISSING'
;

I’ve created a hash partitioned copy of view all_objects, duplicating it 10 times and created a local index on the columns (status, namespace). My data has two values for status, ‘VALID’ and ‘INVALID’, and there are about 10 values for the namespace. I’ve then updated a few rows in the last partition, giving them a status value that is between the two current values – this is just one little test case to help me check that my code is going to catch all values even if they don’t appear in the first table partition.

Here’s the query from the earlier posting – and it does get the right results – followed by the execution plan:


alter session set statistics_level = all;

set serveroutput off
set linesize 180
set pagesize 60

prompt  =============================================================
prompt  Original Query, showing expensive access for driving minimums
prompt  =============================================================

with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select
                        /*+ index(pt1) no_index_ffs(pt1) */
                        status, namespace,
                        row_number() over(order by status, namespace) rn
                from    pt1
        )
        where
                rn = 1
        union all
        select
                v1.status, v1.namespace
        from    bounce1,
                lateral (
                              select  /*+ index(pt1) no_index_ffs(pt1) no_decorrelate */
                                      pt1.status, pt1.namespace
                              from    pt1
                              where   pt1.status > bounce1.status
                              and     rownum = 1
                ) v1
        where   bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last outline'));


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                    | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                             |                 |      1 |        | 16378 (100)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   1 |  SORT ORDER BY                               |                 |      1 |      4 | 16378   (4)|       |       |     10 |00:00:00.58 |    1869 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                       |                 |      1 |      4 | 16377   (4)|       |       |     10 |00:00:00.58 |    1869 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     12 |00:00:00.58 |    1869 |  1024 |  1024 |          |
|*  4 |     VIEW                                     |                 |      1 |      2 |  8157   (4)|       |       |      2 |00:00:00.58 |    1747 |       |       |          |
|   5 |      UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      2 |00:00:00.58 |    1747 |  1024 |  1024 | 2048  (0)|
|*  6 |       VIEW                                   |                 |      1 |      1 |  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |       |       |          |
|*  7 |        WINDOW SORT PUSHED RANK               |                 |      1 |    617K|  4047   (4)|       |       |      1 |00:00:00.58 |    1732 |  2048 |  2048 | 2048  (0)|
|   8 |         PARTITION HASH ALL                   |                 |      1 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.34 |    1732 |       |       |          |
|   9 |          INDEX FULL SCAN                     | PT1_I1          |      4 |    617K|  1759   (2)|     1 |     4 |    617K|00:00:00.15 |    1732 |       |       |          |
|  10 |       NESTED LOOPS                           |                 |      2 |      1 |  4110   (4)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  11 |        RECURSIVE WITH PUMP                   |                 |      2 |        |            |       |       |      2 |00:00:00.01 |       0 |       |       |          |
|  12 |        VIEW                                  | VW_LAT_1BBF5C63 |      2 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      15 |       |       |          |
|* 13 |         COUNT STOPKEY                        |                 |      2 |        |            |       |       |      1 |00:00:00.01 |      15 |       |       |          |
|  14 |          PARTITION HASH ALL                  |                 |      2 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|* 15 |           INDEX RANGE SCAN                   | PT1_I1          |      5 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      15 |       |       |          |
|  16 |     SORT AGGREGATE                           |                 |     10 |      1 |            |       |       |     10 |00:00:00.01 |     122 |       |       |          |
|  17 |      PARTITION HASH ALL                      |                 |     10 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  18 |       FIRST ROW                              |                 |     40 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     122 |       |       |          |
|* 19 |        INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     40 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     122 |       |       |          |
|  20 |     RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     10 |00:00:00.01 |       0 |       |       |          |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
   4 - filter("BOUNCE1"."STATUS" IS NOT NULL)
   6 - filter("RN"=1)
   7 - filter(ROW_NUMBER() OVER ( ORDER BY "STATUS","NAMESPACE")<=1) 13 - filter(ROWNUM=1) 15 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  19 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

In terms of time the query doesn’t seem to have done too badly – but I’m only using a small data set and we can see from the numbers that we haven’t produced an efficient plan. Operations 8 and 9 tell us that we’ve done an index full scan on every single partition before passing the data up for a window sort operation. That’s clearly a bad thing, but we did have an index() hint at that bit of code that worked very well for the simple (global) index so maybe we should have taken that out before testing (except it doesn’t help much to do so since Oracle still scans all 617K rows, changing to an index fast full scan).

Apart from that massive load the rest of the query looks quite efficient. We keep seeing “partition hash all” of course – whatever we do we tend to do it to 4 separate partitions one after the other – but everything else we do looks rather efficient. But there is another problem – and this is where the importance of inserting the rows with status = ‘MISSING’ shows up: this query didn’t find them! We have a predicate “rownum = 1” in the second half of the bounce1 recursive subquery and because we’re using a partitioned index we’ve managed to find a row that looks appropriate in an early partition when the row we really needed doesn’t appear until the last partition.

Let’s return to this problem later – first we want to check if the rest of the query will run efficiently and give us the right answer if we can find some way of getting the starting values; so let’s use a strategy we’ve used before – replace the bounce1 subquery with a union all select from dual:


with bounce1(status, namespace) as (
        select status, namespace
        from    (
                select 'INVALID' status, 1 namespace from dual
                union all
                select 'MISSING', 4 from dual
                union all
                select 'VALID', 1 from dual
        )
),
bounce2 (status, namespace)
as (
        select  status, namespace
        from    bounce1
        where   bounce1.status is not null
        union all
        select  bounce2.status, (select min(pt1.namespace) namespace from pt1 where pt1.status = bounce2.status and pt1.namespace > bounce2.namespace) namespace
        from    bounce2
        where   bounce2.namespace is not null
        and     bounce2.status is not null
)
select * from bounce2
where
        bounce2.namespace is not null
and     bounce2.status is not null      -- > redundant predicate
order by
        status, namespace
;

select * from table(dbms_xplan.display_cursor(null,null,'cost allstats last partition outline'));

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |        |      1 |        |    76 (100)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   1 |  SORT ORDER BY                             |        |      1 |      6 |    76   (2)|       |       |     11 |00:00:00.01 |     132 |  2048 |  2048 | 2048  (0)|
|*  2 |   VIEW                                     |        |      1 |      6 |    75   (0)|       |       |     11 |00:00:00.01 |     132 |       |       |          |
|   3 |    UNION ALL (RECURSIVE WITH) BREADTH FIRST|        |      1 |        |            |       |       |     14 |00:00:00.01 |     132 |  1024 |  1024 |          |
|   4 |     VIEW                                   |        |      1 |      3 |     6   (0)|       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   5 |      UNION-ALL                             |        |      1 |        |            |       |       |      3 |00:00:00.01 |       0 |       |       |          |
|   6 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   7 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   8 |       FAST DUAL                            |        |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |       |       |          |
|   9 |     SORT AGGREGATE                         |        |     11 |      1 |            |       |       |     11 |00:00:00.01 |     132 |       |       |          |
|  10 |      PARTITION HASH ALL                    |        |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  11 |       FIRST ROW                            |        |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     132 |       |       |          |
|* 12 |        INDEX RANGE SCAN (MIN/MAX)          | PT1_I1 |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     132 |       |       |          |
|  13 |     RECURSIVE WITH PUMP                    |        |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(("BOUNCE2"."NAMESPACE" IS NOT NULL AND "BOUNCE2"."STATUS" IS NOT NULL))
  12 - access("PT1"."STATUS"=:B1 AND "PT1"."NAMESPACE">:B2)

This gets us the right answer, very efficiently. There are only 11 rows in the result set and we have an average 12 buffer visits per row – which is reasonble given that we (probably) have to probe 4 index partitions for every row. So that’s 11 * 4 * 3 buffer visits per probe – which seems just about optimal.

The next step is to figure out a way of getting the (three in our case) starting points while using a partitioned index. Here’s a query we can use for bounce1:


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
)
select * from bounce1
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

It looks a little convoluted with all the inline select statements, but they all do very small amounts of work and they’re only reading the index leaf blocks that you have to read. We know from yesterday’s post that Oracle can execute the scalar subqueries at lines 3 and 4 very efficiently; we can hope (and check) that the lateral() subqueries driven by the single values from the recursive row in bounce1 will operate just as efficiently – and here’s the plan:


----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                 | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                 |      1 |        |   166 (100)|       |       |      4 |00:00:00.01 |     132 |
|   1 |  VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   3 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   4 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   5 |      INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   6 |    SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   7 |     PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   8 |      FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  9 |       INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  11 |         PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  12 |          INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  13 |    FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  14 |    NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  15 |     NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  16 |      RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  17 |      VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |       SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  19 |        PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  20 |         FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 21 |          INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |     VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  23 |      SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  24 |       PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  25 |        FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 26 |         INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |           PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |            FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |             INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - access("STATUS"=)
  21 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  26 - access("PT1"."STATUS"=)
  30 - access("PT2"."STATUS">:B1)

Although we have done lots of individual probes into the index they have all been very efficient using a min/max access and an average of about 3 buffer visits per probe. So we can now insert this new bounce1 subquery into the previous query in place of the union all of dual and check that the two pieces of the query cooperate.


with bounce1(status, namespace) as (
        select
                (select min(status) from pt1) status,
                (select /*+ index(pt1) */ min(namespace) from pt1 where status = (select min(status) from pt1)) namespace
        from
                dual
        union all
        select
                v1.status, v2.namespace
        from    bounce1,
                lateral(
                        (select /*+ index(pt1) */ min(pt1.status) status from pt1 where pt1.status > bounce1.status)
                )       v1,
                lateral(
                        select /*+ index(pt1) */ min(pt1.namespace) namespace
                        from pt1
                        where pt1.status =  (select min(pt2.status) from pt1 pt2 where pt2.status > bounce1.status)
                )       v2
        where
                bounce1.status is not null
        and     bounce1.namespace is not null
),
bounce2 (status, namespace)
as (
        select  status, namespace from bounce1
        union all
        select  bounce2.status, (select min(t.namespace) namespace from pt1 t where t.namespace > bounce2.namespace and status=bounce2.status) namespace
        from    bounce2
        where   bounce2.status is not null
        and     bounce2.namespace is not null
)
select  *
from    bounce2
where   namespace is not null
;

select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition outline'));

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                            |                 |      1 |        |   396 (100)|       |       |     11 |00:00:00.01 |     266 |
|*  1 |  VIEW                                       |                 |      1 |      4 |   396   (1)|       |       |     11 |00:00:00.01 |     266 |
|   2 |   UNION ALL (RECURSIVE WITH) BREADTH FIRST  |                 |      1 |        |            |       |       |     15 |00:00:00.01 |     266 |
|   3 |    VIEW                                     |                 |      1 |      2 |   166   (0)|       |       |      4 |00:00:00.01 |     132 |
|   4 |     UNION ALL (RECURSIVE WITH) BREADTH FIRST|                 |      1 |        |            |       |       |      4 |00:00:00.01 |     132 |
|   5 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)            | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   8 |      SORT AGGREGATE                         |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   9 |       PARTITION HASH ALL                    |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  10 |        FIRST ROW                            |                 |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|* 11 |         INDEX RANGE SCAN (MIN/MAX)          | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|  12 |          SORT AGGREGATE                     |                 |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|  13 |           PARTITION HASH ALL                |                 |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  14 |            INDEX FULL SCAN (MIN/MAX)        | PT1_I1          |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|  15 |      FAST DUAL                              |                 |      1 |      1 |     2   (0)|       |       |      1 |00:00:00.01 |       0 |
|  16 |      NESTED LOOPS                           |                 |      4 |      1 |   146   (0)|       |       |      3 |00:00:00.01 |      96 |
|  17 |       NESTED LOOPS                          |                 |      4 |      1 |   137   (0)|       |       |      3 |00:00:00.01 |      36 |
|  18 |        RECURSIVE WITH PUMP                  |                 |      4 |        |            |       |       |      3 |00:00:00.01 |       0 |
|  19 |        VIEW                                 | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      36 |
|  20 |         SORT AGGREGATE                      |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  21 |          PARTITION HASH ALL                 |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  22 |           FIRST ROW                         |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 23 |            INDEX RANGE SCAN (MIN/MAX)       | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  24 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |     9   (0)|       |       |      3 |00:00:00.01 |      60 |
|  25 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      60 |
|  26 |         PARTITION HASH ALL                  |                 |      3 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  27 |          FIRST ROW                          |                 |     12 |      1 |     9   (0)|       |       |      5 |00:00:00.01 |      60 |
|* 28 |           INDEX RANGE SCAN (MIN/MAX)        | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      5 |00:00:00.01 |      60 |
|  29 |            SORT AGGREGATE                   |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  30 |             PARTITION HASH ALL              |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  31 |              FIRST ROW                      |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 32 |               INDEX RANGE SCAN (MIN/MAX)    | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  33 |    SORT AGGREGATE                           |                 |     11 |      1 |            |       |       |     11 |00:00:00.01 |     134 |
|  34 |     PARTITION HASH ALL                      |                 |     11 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  35 |      FIRST ROW                              |                 |     44 |      1 |     9   (0)|       |       |     27 |00:00:00.01 |     134 |
|* 36 |       INDEX RANGE SCAN (MIN/MAX)            | PT1_I1          |     44 |      1 |     9   (0)|     1 |     4 |     27 |00:00:00.01 |     134 |
|  37 |    RECURSIVE WITH PUMP                      |                 |     10 |        |            |       |       |     11 |00:00:00.01 |       0 |
------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("NAMESPACE" IS NOT NULL)
  11 - access("STATUS"=)
  23 - access("PT1"."STATUS">"BOUNCE1"."STATUS")
  28 - access("PT1"."STATUS"=)
  32 - access("PT2"."STATUS">:B1)
  36 - access("STATUS"=:B1 AND "T"."NAMESPACE">:B2)

Job done. We’ve found the distinct set of pairs without having to scan the entire index. We’ve found 11 pairs at a total cost of 266 buffer gets. For comparitive purposes the query totalled 56 buffer visits when I recreated the table as a non-partitioned table (again updating a few rows to status = ‘MISSING’).

It’s important to note that this query can only work this efficiently in 12.2 (and possibly in a suitably patched 11.2.0.4) because of the optimizer’s ability to use the min/max operation for queries like: “select max(col2) where col1 = (select max()…))”. When I ran the final query on 12.1.0.2 the execution plan changed around lines 11 and 28 where 12.2.0.1 could use the aggregate subquery to drive the min/max scan 12.1.0.2 did a real range scan with aggregate (which was extremely expensive at one point).

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                   | Name            | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------------------------
|  23 |       VIEW                                  | VW_LAT_C2D92EFA |      3 |      1 |  1206   (2)|       |       |      3 |00:00:05.43 |    2414 |
|  24 |        SORT AGGREGATE                       |                 |      3 |      1 |            |       |       |      3 |00:00:05.43 |    2414 |
|  25 |         PARTITION HASH ALL                  |                 |      3 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:05.84 |    2414 |
|* 26 |          INDEX RANGE SCAN                   | PT1_I1          |     12 |    422K|  1206   (2)|     1 |     4 |    845K|00:00:02.03 |    2414 |
|  27 |           SORT AGGREGATE                    |                 |      3 |      1 |            |       |       |      3 |00:00:00.01 |      36 |
|  28 |            PARTITION HASH ALL               |                 |      3 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
|  29 |             FIRST ROW                       |                 |     12 |      1 |     9   (0)|       |       |      8 |00:00:00.01 |      36 |
|* 30 |              INDEX RANGE SCAN (MIN/MAX)     | PT1_I1          |     12 |      1 |     9   (0)|     1 |     4 |      8 |00:00:00.01 |      36 |
------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see, this makes a dramatic difference to the work Oracle has to do – in this case 2,414 buffer gets and 845K rows examined. As I said in yestrday’s post – there’s a patch for 11.2.0.4, so there could be a patch for 12.1.0.2 if you ask for it, but it looks like no-one has done so yet.

<h3>Footnote:</h3>

I could have used a lateral() view in the first half of bounce1 to reduce the reported number of probes of pt1_i1 in the plan – but it made the code extremely messy, I had to include a /*+ no_decorrelate */ hint in it, and it increased the number of buffer visits slightly because the optimizer seemed to lose the option for a min/max scan in this particular lateral join.

 

May 31, 2018

Min/Max upgrade

Filed under: 12c,Indexing,Oracle,Partitioning,Performance — Jonathan Lewis @ 2:13 pm BST May 31,2018

Here’s a nice little optimizer enhancement that appeared in 12.2 to make min/max range scans (and full scans) available in more circumstances. Rather than talk through it, here’s a little demonstration:

rem
rem     Script:         122_minmax.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1        Good path
rem             12.1.0.2        Bad path

create table pt1 (
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
)
nologging
partition by hash (object_id) partitions 4
as
select
        object_id,
        owner,
        object_type,
        object_name,
        status,
        namespace
from
        (select * from all_objects),
        (select rownum n1 from dual connect by level <= 10) -- > comment to avoid format wordpress issue
;

alter table pt1 modify(status not null);

execute dbms_stats.gather_table_stats(null,'pt1',granularity=>'ALL',method_opt=>'for all columns size 1')

create index pt1_i1 on pt1(status, namespace) nologging local;

alter session set statistics_level = all;
set serveroutput off
set linesize 156
set pagesize 60
set trimspool on

select  min(status) from pt1;
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = 'INVALID';
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

select  min(namespace) from pt1 where status = (select min(status) from pt1);
select * from table(dbms_xplan.display_cursor(null,null,'allstats last cost partition'));

The basic “min/max” optimisation allows Oracle to avoid a massive sort aggregate – Oracle doesn’t need to acquire a lot of data and sort it when it knows that the “left hand” end of an index is the low values and the “right hand” is the high values so, for example, in the first query above the optimizer could simply walk down the index branches to the left hand leaf and look at the single lowest entry in the leaf block to determine the lowest value for status … if the index had been a global index.

Things get a little messy, though, when the index is locally partitioned and your query isn’t about the partition key and there’s no suitable global index. Once upon a time (IIRC) Oracle would simply have to do an index fast full scan across all index partitions to handle such a query, but some time ago it got a lot cleverer and was enhanced to do a min/max scan on each partition in turn getting one value per partition very efficiently, then aggregating across those values to find the global minimum.

Here are the three execution plans (with rowsource execution stats pulled from memory) taken from 12.1.0.2 for the queries above:


-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   3 |    INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
-----------------------------------------------------------------------------------------------------------------------------


-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      12 |
|   1 |  SORT AGGREGATE               |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   2 |   PARTITION HASH ALL          |        |      1 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
|   3 |    FIRST ROW                  |        |      4 |      1 |     9   (0)|       |       |      1 |00:00:00.01 |      12 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      1 |00:00:00.01 |      12 |
-------------------------------------------------------------------------------------------------------------------------------


-----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |
-----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |        |      1 |        |   337 (100)|       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   1 |  SORT AGGREGATE                |        |      1 |      1 |            |       |       |      1 |00:00:00.07 |    2402 |   2242 |
|   2 |   PARTITION HASH ALL           |        |      1 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|*  3 |    INDEX FAST FULL SCAN        | PT1_I1 |      4 |    422K|   328  (10)|     1 |     4 |     10 |00:00:00.07 |    2402 |   2242 |
|   4 |     SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |      0 |
|   5 |      PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
|   6 |       INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |      0 |
-----------------------------------------------------------------------------------------------------------------------------------------

In the first plan Oracle has done an “index full scan (min/max)” across each of the four partitions in turn to return one row very cheaply from each, then aggregated to find the overall minimum.

In the second plan Oracle has done an “index range scan (min/max)” in exactly the same way, since it was able to find the start point in the index for the status ‘INVALID’ very efficiently.

In the third plan Oracle has been able to find the minimum value for the status (‘INVALID’) very efficiently in the subquery, and has passed that single value up to the main query, which has then used a brute force approach to search the whole of every partition of the index for every occurrence (all 10 of them) of the value ‘INVALID’ and then aggregated them to find the minimum namespace. Despite “knowing”, by the time the main query runs, that there will be a single value to probe for the status, the optimizer has not anticipated the fact that the final query will effectively become the same as the preceding one. As a result we’ve read 2,242 data blocks into the cache.

Turn, then, to the execution plan from 12.2.0.1 for this last query:


---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name   | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |        |      1 |        |     9 (100)|       |       |      1 |00:00:00.01 |      24 |
|   1 |  SORT AGGREGATE                 |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      24 |
|   2 |   PARTITION HASH ALL            |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   3 |    FIRST ROW                    |        |      4 |      1 |     9   (0)|       |       |      4 |00:00:00.01 |      24 |
|*  4 |     INDEX RANGE SCAN (MIN/MAX)  | PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      24 |
|   5 |      SORT AGGREGATE             |        |      1 |      1 |            |       |       |      1 |00:00:00.01 |      12 |
|   6 |       PARTITION HASH ALL        |        |      1 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
|   7 |        INDEX FULL SCAN (MIN/MAX)| PT1_I1 |      4 |      1 |     9   (0)|     1 |     4 |      4 |00:00:00.01 |      12 |
---------------------------------------------------------------------------------------------------------------------------------

In 12.2 you can see that the main query is now doing an “index range scan (min/max)” on each index partition in turn, based on the incoming (though unknown at parse time) single value from the subquery. As a result the total work done is a mere 24 buffer visits.

There have been a couple of occasions in the past where I’ve had to write some PL/SQL to work around little details like this. It’s nice to know simple tables and partitioned tables with local indexes can now behave the same way. I also wonder whether there may be sites that could drop (or drop columns from, or make local) some indexes that they’ve previously created to  handle queries of the “most recent occurrence” type.

If, for any reason, you need to disable this enhancement, it’s controlled by fix_control (v$system_fix_control) “18915345 Allow MIN/MAX optimization for pred having single row subquery” which can be set in the startup file, at the system level, or in the session.

Update

Checking MoS for the bug number I found that the limitation had been reported for 11.2.0.3, with “Fixed in product version” reported as 12.2; but there are patches for various releases of 11.2.0.4, though none yet for 12.1.0.2 – but if you think you need it you can always try raising an SR.

 

March 19, 2018

Reference Costs

Filed under: Oracle,Partitioning,Performance — Jonathan Lewis @ 7:44 am BST Mar 19,2018

The partitioning option “partition by reference” is a very convenient option which keeps acquiring more cute little features, such as cascading truncates and cascading splits, as time passes – but what does it cost and would you use it if you don’t really need to.

When reference partitioning came into existence many years ago, I had already seen several performance disasters created by people’s enthusiasm for surrogate keys and the difficulties this introduced for partition elimination; so my first thought was that this was a mechanism that would have a hugely beneficial effect on systems which (in 20:20 – or 6:6 if you’re European – hindsight) had been badly designed and would otherwise need a lot of re-engineering to use partitioning effectively.

(Side note: Imagine you have partitioned an orders table on colX which is a column in the real (business-oriented) candidate key, but you’ve created a surrogate key which is used as the target for a foreign key from the order_lines tables – how do you get partition-wise joins between orders and order_lines if you haven’t got the partitioning column in the order_lines table ?)

So ref partitioning was a good way to workaround a big existing problem and, whatever overheads it introduced, the benefit was potentially so  huge that you wouldn’t care (or, probably, notice) that your system was less efficient than it ought to be. But what if you’re working on a new project and still have control of the physical design – how does that change the cost/benefit analysis.

It’s actually taken me several years to get round to producing a little demonstration to highlight one of the obvious costs of reference partitioning – even though it’s a very simple demo raising the obvious question: ‘how much work does Oracle have to do to find the right partition when inserting a “child” row ?’ If you chose to implement reference partitioning without asking that simple question you may be using a lot more machine resources than you really need to, although you may not actually be heading for a disastrous performance problem.

As a demonstration of the issue I’m going to set up something that approximates an order/order_lines model in two ways, one using reference partitioning and one using a copied column, to see what differences show up when you start loading data.

rem
rem     Script:         pt_ref.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Mar 2018
rem     Purpose:
rem
rem     Last tested
rem             12.2.0.1
rem             12.1.0.2
rem

create table orders (
        id              number(10,0) not null,
        id_cust         number(10,0) not null,
        date_ordered    date not null,
        padding         varchar2(150)
)
partition by range (date_ordered)
(
        partition p201801       values less than (to_date('01-Feb-2018')),
        partition p201802       values less than (to_date('01-Mar-2018')),
        partition p201803       values less than (to_date('01-Apr-2018')),
        partition p201804       values less than (to_date('01-May-2018')),
        partition p201805       values less than (to_date('01-Jun-2018')),
        partition p201806       values less than (to_date('01-Jul-2018')),
        partition p201807       values less than (to_date('01-Aug-2018'))
);

create unique index ord_pk on orders (id);
alter table orders add constraint ord_pk primary key(id);

create table order_lines (
        id_ord          number(10,0) not null,
        line_number     number(4,0)  not null,
        id_product      number(6,0)  not null,
        qty             number(6,0)  not null,
        value           number(10,2) not null,
        padding         varchar2(150),
        constraint orl_fk_ord foreign key (id_ord) references orders
                on delete cascade
)
partition by reference (orl_fk_ord)
;

create unique index orl_pk on order_lines (id_ord, line_number);
alter table order_lines add constraint orl_pk primary key (id_ord, line_number);

create table order_lines_2 (
        date_ordered    date,
        id_ord          number(10,0) not null,
        line_number     number(4,0)  not null,
        id_product      number(6,0)  not null,
        qty             number(6,0)  not null,
        value           number(10,2) not null,
        padding         varchar2(150),
        constraint orl2_fk_ord foreign key (id_ord) references orders
                on delete cascade
)
partition by range (date_ordered)
(
        partition p201801       values less than (to_date('01-Feb-2018')),
        partition p201802       values less than (to_date('01-Mar-2018')),
        partition p201803       values less than (to_date('01-Apr-2018')),
        partition p201804       values less than (to_date('01-May-2018')),
        partition p201805       values less than (to_date('01-Jun-2018')),
        partition p201806       values less than (to_date('01-Jul-2018')),
        partition p201807       values less than (to_date('01-Aug-2018'))
)
;

create unique index orl2_pk on order_lines_2 (id_ord, line_number);
alter table order_lines_2 add constraint orl2_pk primary key (id_ord, line_number);

It’s a bit of a bodge job as far as modelling goes, but that’s to keep workload comparisons easy and make a point without writing too much code. All I’ve got is an orders table partitioned by date and an order_lines table that I want partitioned the same way. I’ve handled the requirement for partitioning order_lines in two ways, one is partition by reference and the other is to copy down the partitioning column from the orders table. (In my view the “real” key for an orders table should be (customer identifier, order date, counter) and if I thought efficient partitioning was going to be a necessary feature for scalability I would copy down all three columns. Depending on the nature of the business I would compress the primary key index on orders on one or two of the columns, and the foreign key index on order_lines on one, two, or three of its columns)

Now all I have to do is load some data into the tables. First the orders table:

insert into orders(
        id, id_cust, date_ordered, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        rownum                                  id,
        trunc(dbms_random.value(10000,20000))   id_cust,
        to_date('01-Jan-2018') +
                trunc((rownum-1)/100)           date_ordered,
        rpad('x',40)                            padding
from
        g,g
where
        rownum <= 2e4
;

commit;
execute dbms_stats.gather_table_stats(user,'orders')

This produces 100 orders per day, for 200 days which fits within the seven months of pre-declared partitions. I’ve gathered table stats on the table because that’s probably the best way to deal with any requirements for block cleanout after the insert. (Note: I’m avoiding interval partitioning in this example because that’s just another complication to add to the comparison and, as I reported a few days ago, introduces another massive inefficiency on data loading.)

Now I’ll insert some order_lines rows at 5 lines per order into the two versions of the order_lines tables. One of them, of course, has to have a date generated using the same algorithm that I used for the orders table. Note that I’ve made a call to dbms_random.seed(0) before each insert to guarantee that the same “random” values will be inserted in both table.

execute dbms_random.seed(0)

insert into order_lines_2(
        date_ordered, id_ord, line_number, id_product, qty, value, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        to_date('01-Jan-2018') +
                trunc((rownum-1)/500)           date_ordered,
        1 + trunc((rownum-1)/5)                 id_ord,
        1 + mod(rownum,5)                       line_number,
        trunc(dbms_random.value(10000,20000))   id_product,
        1 qty,
        1 value,
        rpad('x',80)                            padding
from
        g,g
where
        rownum <= 10e4
;

commit;


execute dbms_random.seed(0)

insert into order_lines(
        id_ord, line_number, id_product, qty, value, padding
)
with g as (
        select rownum id from dual
        connect by level <= 1e4
)
select
        1 + trunc((rownum-1)/5)                 id_ord,
        1 + mod(rownum,5)                       line_number,
        trunc(dbms_random.value(10000,20000))   id_product,
        1 qty,
        1 value,
        rpad('x',80)                            padding
from
        g,g
where
        rownum <= 10e4
;

commit;

What I haven’t shown in the code is the snapshot wrapping I used to check the session stats, system latch activity and system rowcache activity – which I thought would give me the best indication of any variation in workload. In fact, of course, the first and simplest variation was the elapsed time: 4.5 seconds for the ref partitioned table, 2.5 seconds for the explicitly created table (regardless of which insert I did first), and it was nearly all pure CPU time.

It turned out that the rowcache stats showed virtually no variation, and the latch stats only showed significant variation in the latches that I could have predicted from the session stats, and here are the most significant session stats that highlight and explain the difference in times:

Explicitly Created
------------------
CPU used by this session                                                   231
DB time                                                                    242
db block gets                                                          219,471
db block changes                                                        27,190
redo entries                                                            15,483
redo size                                                           24,790,224
HSC Heap Segment Block Changes                                           2,944

Ref partitioned
---------------
CPU used by this session                                                   515
DB time                                                                    532
db block gets                                                          615,979
db block changes                                                       418,025
redo entries                                                           209,918
redo size                                                           70,043,676
HSC Heap Segment Block Changes                                         100,048

These results were from 12.1.0.2, but the figures from 12.2.0.1 and 11.2.0.4 were similar though the CPU time dropped as the version number went up: what you’re seeing is the effect of turning an array insert (for the precreated table) into single row processing for the ref partitioned table. Basically it seems that for every row inserted Oracle has to do something to work out which partition the row should go into, and while it does that work it release any pins of buffers it would have been holding from the previous row’s insert; in other words, various optimisations relating to array inserts are not taking place.

  • Looking in more detail at the figures for the ref partition insert:
  • The 100,000 “HSC heap Segment Block Changes” equate to the 100,000 rows inserted into the table
  • Add the single row index updates to the primary key and you get 200,000 redo entries.
  • For every individual row inserted Oracle has to do a current mode (db block gets) check against the primary key of the orders table – but when array processing the root block can be pinned.

We can get a closer look at the differences by taking snapshots of v$segstat (or v$segment_statistics), to see the following (pre-created table on the left):


ORD_PK                                |   ORD_PK
  logical reads          199,440      |     logical reads          300,432
                                      |
ORDER_LINES_2 - P201801               |   ORDER_LINES - P201801
  logical reads            2,112      |     logical reads           16,960
  db block changes         1,280      |     db block changes        16,944
                                      |
ORDER_LINES_2 - P201802               |   ORDER_LINES - P201802
  logical reads            2,256      |     logical reads           16,144
  db block changes         1,248      |     db block changes        15,088
                                      |
ORDER_LINES_2 - P201803               |   ORDER_LINES - P201803
  logical reads            2,288      |     logical reads           17,264
  db block changes         1,376      |     db block changes        16,560
                                      |
ORDER_LINES_2 - P201804               |   ORDER_LINES - P201804
  logical reads            2,672      |     logical reads           16,768
  db block changes         1,280      |     db block changes        16,144
                                      |
ORDER_LINES_2 - P201805               |   ORDER_LINES - P201805
  logical reads            2,224      |     logical reads           17,472
  db block changes         1,264      |     db block changes        16,528
                                      |
ORDER_LINES_2 - P201806               |   ORDER_LINES - P201806
  logical reads            2,624      |     logical reads           16,800
  db block changes         1,328      |     db block changes        16,160
                                      |
ORDER_LINES_2 - P201807               |   ORDER_LINES - P201807
  logical reads            1,376      |     logical reads           10,368
  db block changes           864      |     db block changes        10,752
                                      |
ORL2_PK                               |   ORL_PK
  logical reads           10,640      |     logical reads          206,352
  db block changes         7,024      |     db block changes       104,656

The right hand data set does an extra 100,000 logical reads on the ORD_PK index (top set of lines) which I think are the 100,000 gets on the root block that was pinned for the table on the left – the numbers don’t quite add up, so there’s some extra complexity that I haven’t guessed correctly.

The insert into the ORL[2]_PK index (lines) is single row processed for the right hand table – with, I think, the logical reads recording two current gets per insert.

Every partition of the table, except the last, shows 15,000 db block changes, totalling a difference of about 100,000 db block changes corresponding to the single rows being inserted. Then ORL[2]_PK shows another 100,000 db block changes, giving us the 200,000 we saw as redo entries and 400,000 (when doubled up to allow for the undo) db block changes that we saw in total.

Finally we need to explain the difference of 400,000 db block gets between the two sets of session stats – and I think this is the extra 100,000 for ORD_PK, the 100,000 for the table inserts, and 200,000 for the ORL[2]_PK index, which I think might be explained as 100,000 as a current get that checks for “duplicate key” and 100,000 gets to do the actual insert.

Bottom Line, though – if you use reference partitioning every array insert seems to turn into single row processing with the attendant increase in buffer gets, undo and redo generated,  latch activity, and CPU used as Oracle checks for every single row which partition it should go into: and there doesn’t seem to be any optimisation that caters for “this row belongs in the same partition as the previous row”. You may decide that this extra cost due to reference partitioning is worth it for the benefits that reference partitioning supplies – it’s all down to what your application does, especially in terms of aging data perhaps – but it’s nice to know that this cost is there so that you can do a better cost/benefit analysis.

Footnote:

Interested readers might like to extend this test to a multi-layered set of ref-partitioned tables to see if the increase in overheads is linear or geometric.

 

March 15, 2018

Keeping Intervals

Filed under: 12c,Oracle,Partitioning — Jonathan Lewis @ 8:03 am BST Mar 15,2018

I’ve recently been reminded of a blog post I wrote a couple of years ago that discussed the issue of running into the hard limit of 2^20 -1 as the number of segments for a (composite) partitioned table – a problem that could arise in a relatively short time if you used a large number of hash subpartitions in an interval/hash composite partitioned table (you get about 2 years and 10 months of daily partitions at 1,024 subpartitions per day, for example).

A natural follow-on from that article is to think through a strategy for dropping old partitions sufficiently early that you don’t hit the limit as new partitions are created. This, of course, pretty much defeats the point of interval partitioning – instead of planning to add partitions “just in time” you now have to eliminate them “just in time”. Amongst other issues, we’re going to find that interval partitioning manages to re-introduce a problem with range partitioning that Oracle got rid of in Oracle 10g.

So let’s test the obvious option: drop the oldest partition(s) in time to keep head-room for new partitions; for convenience we’ll start with a simple interval partitioned table with a few pre-declared range partitions and a few automatically generated interval partitions. All the examples here were run under 12.1.0.2:


rem
rem     Script:         pt_merge.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

create table t1(id, v1, padding)
partition by range (id) interval (1e4)
(
        partition p10000 values less than (1e4),
        partition p20000 values less than (2e4),
        partition p30000 values less than (3e4),
        partition p40000 values less than (4e4),
        partition p50000 values less than (5e4)
)
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4 -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > comment to avoid WordPress format issue
;


begin
        dbms_stats.gather_table_stats(
                ownname     => user,
                tabname     => 'T1',
                method_opt  => 'for all columns size 1'
        );
end;
/

SEGMENT_NAME              PARTITION_NAME         HEADER_BLOCK     BLOCKS
------------------------- ---------------------- ------------ ----------
T1                        P10000                          128        256
T1                        P20000                          384        256
T1                        P30000                          640        256
T1                        P40000                          896        256
T1                        P50000                         1152        256
T1                        SYS_P69838                     1408        256
T1                        SYS_P69839                     1664        256
T1                        SYS_P69840                     1920        256
T1                        SYS_P69841                     2176        256
T1                        SYS_P69842                     2432        256
T1                        SYS_P69843                     2688        128

11 rows selected.


I’ve created 100,000 rows and since the partitions I’ve pre-declared have an (unreachable) upper bound of only 50,000 Oracle will have added a further 6 partitions to the table to hold the data for values up to 110,000 (with just one row in the last partition). For testing purposes I’ve created the table in an otherwise empty tablespace so when I check the block address of each segment I can see the location (and size) of the segments so far. So here’s the list of names and locations:

SEGMENT_NAME              PARTITION_NAME         HEADER_BLOCK     BLOCKS
------------------------- ---------------------- ------------ ----------
T1                        P10000                          128        256
T1                        P20000                          384        256
T1                        P30000                          640        256
T1                        P40000                          896        256
T1                        P50000                         1152        256
T1                        SYS_P69838                     1408        256
T1                        SYS_P69839                     1664        256
T1                        SYS_P69840                     1920        256
T1                        SYS_P69841                     2176        256
T1                        SYS_P69842                     2432        256
T1                        SYS_P69843                     2688        128

11 rows selected.

No surprises so far. So let’s pretend we know the dreaded ORA-14299 or ORA-14300 will be arriving soon and try to drop the first 5 partitions to keep the partition count below the limit. Here’s a cut-n-paste from an SQL*Plus session that tries to do that one partition at a time:

SQL> alter table t1 drop partition p10000;

Table altered.

SQL> alter table t1 drop partition p20000;

Table altered.

SQL> alter table t1 drop partition p30000;

Table altered.

SQL> alter table t1 drop partition p40000;

Table altered.

SQL> alter table t1 drop partition p50000;
alter table t1 drop partition p50000
                              *
ERROR at line 1:
ORA-14758: Last partition in the range section cannot be dropped

We can’t drop partition p50000 – it’s the highest partition that wasn’t created automatically, and we have to leave an “anchor” partition in place for interval partitioning to work from. By querying user_tab_partitions we can even see that this partition is flagged a little differently from the others:


select
        partition_name, interval, high_value 
from
        user_tab_partitions
where
        table_name = 'T1'
order by
        partition_position
;


PARTITION_NAME         INT HIGH_VALUE
---------------------- --- --------------------------
P50000                 NO  5e4
SYS_P69844             YES 60000
SYS_P69845             YES 70000
SYS_P69846             YES 80000
SYS_P69847             YES 90000
SYS_P69848             YES 100000
SYS_P69849             YES 110000

7 rows selected.

So, at first sight, we’re stuck. If we’re dropping old partitions we will eventually get to a point where there’s only one “real” range partition at the bottom and then we can’t drop any more historic partitions. There are two solutions to this problem, explained a long time ago here and here by Harald van Breederode.

Option 1

Convert the interval partitioned table to a range partitioned table and back again, and if you know the interval (and you can always look it up in the data dictionary) there’s a quick and dirty way of doing that. Here’s a cut-n-paste demonstrating the method and effect:


SQL> alter table t1 set interval (10000);

1Table altered.

SQL> select partition_name, interval, high_value from user_tab_partitions where table_name = 'T1' order by partition_position ; 

PARTITION_NAME         INT HIGH_VALUE
---------------------- --- --------------------------
P10000                 NO  1e4
P20000                 NO  2e4
P30000                 NO  3e4
P40000                 NO  4e4
P50000                 NO  5e4
SYS_P69850             NO  60000
SYS_P69851             NO  70000
SYS_P69852             NO  80000
SYS_P69853             NO  90000
SYS_P69854             NO  100000
SYS_P69855             NO  110000

11 rows selected.

SQL> select table_name, partitioning_type, interval from user_part_tables;

TABLE_NAME           PARTITION INTERVAL
-------------------- --------- --------------------
T1                   RANGE     1E4

1 row selected.

Every single partition has just become a range-based partition, but the table is still interval partitioned. This is a tidy solution, but there’s one obvious, generic, drawback to the method.  The “theory” of interval partitioning is that you don’t have to pre-create partitions in anticipation of the data arriving – so what will happen if a (possibly bad) row arrives weeks ahead of schedule and you find that Oracle has created (say) partition 85,001 with a gap of 12,000 partitions between the current high partition and the new one. If you use this “convert to range and back” trick then you’ll have a single partition covering the entire range where you were expecting (eventually) to have 12,000 partitions. Every time you convert from interval to range and back you’d better have code that checks if there are any gaps first, and then does loads of “split partition” –  or comes up with some other strategy – to address the side effects.

Option 2

When you’ve got just one range partition left, merge the bottom two partitions – this makes the next partition up a range partition without affecting any other partitions. After recreating the original table and dropping the first 4 partitions this is how things go:


SQL> alter table t1 drop partition p50000;
alter table t1 drop partition p50000
                              *
ERROR at line 1:
ORA-14758: Last partition in the range section cannot be dropped


SQL> alter table t1 merge partitions for (45000), for (55000) into partition p_low;

Table altered.

SQL> select partition_name, interval, high_value from user_tab_partitions where table_name = 'T1' order by partition_position;

PARTITION_NAME         INTERVAL             HIGH_VALUE
---------------------- -------------------- --------------------------
P_LOW                  NO                   60000
SYS_P69863             YES                  70000
SYS_P69864             YES                  80000
SYS_P69865             YES                  90000
SYS_P69866             YES                  100000
SYS_P69867             YES                  110000

6 rows selected.

Is this too good to be true ? Of course it is, but you may have to pause for a moment to think why. When you merge two partitions Oracle copies the contents of the two segments into a new segment – always; even if one of the two segments is empty. When you do a “split partition” Oracle runs a check to see if the split would leave all the data in a single segment and if it would then Oracle doesn’t do any copying but simply plays clever games in the data dictionary – unfortunately Oracle doesn’t use the same sort of trick to optimise a merge.

So the merge partition mechanism carries less risk than the “interval/range/interval”, but you either pay the cost of the merge or you carefully code the mechanism so that the bottom two partitions are always empty when you merge: for example you might always leave the bottom (range) partition empty and use your scheduled code to truncate (or exchange out) the lowest interval partition, then do the merge.

The good news

When you upgrade to 12.2.0.1 you can drop the lowest partition – and Oracle will simply turn the lowest interval partition currently in existence into a range partition. (That may be a bit of a nuisance if there’s a gap between the range partition and the current lowest interval partition.)

The Bad News

It doesn’t really matter which strategy you use to deal with this problem (even if you’ve upgraded to 12.2) – you still pay one other penalty for both mechanisms. And that’s the bit which re-introduces a problem that last existed in 9i.

Ask youself “How does Oracle know which interval a partition is for and what the limit is on the partitioning key ?” Then look at the data dictionary, or maybe build a very simple model and trace what happens when you use either of the methods above – but in your model create a significant number or partitions first. I’m going to take the data dictionary method – starting from the point where I’ve created and populated the table. Again this is cut-n-paste, and do note that I switch to the sys account after creating the table:


SQL> select object_id, object_name, subobject_name from user_objects;

 OBJECT_ID OBJECT_NAME          SUBOBJECT_NAME
---------- -------------------- ----------------------
    185164 T1
    185165 T1                   P10000
    185166 T1                   P20000
    185167 T1                   P30000
    185168 T1                   P40000
    185169 T1                   P50000
    185170 T1                   SYS_P69868
    185171 T1                   SYS_P69869
    185172 T1                   SYS_P69870
    185173 T1                   SYS_P69871
    185174 T1                   SYS_P69872
    185175 T1                   SYS_P69873

12 rows selected.

SQL> connect / as sysdba
Connected.

SQL> select obj#, dataobj#, part# from tabpart$ where bo# = 185164 order by part#;

      OBJ#   DATAOBJ#      PART#
---------- ---------- ----------
    185165     185165         10
    185166     185166         20
    185167     185167         30
    185168     185168         40
    185169     185169         50
    185170     185170 2147483648
    185171     185171 2147483649
    185172     185172 2147483650
    185173     185173 2147483651
    185174     185174 2147483652
    185175     185175 2147483653

11 rows selected.

I’ve queried user_objects to find the object_id of the table then used that as the “base object number” (bo#) to query tabpart$, which holds the table partition definitions. Note how there are 5 partitions where the partition number goes up 10 at a time, and 6 where it goes up one at a time. Prior to 10g (and interval partitions, of course) the stored partition number would increase in steps of 1 but if you wanted to do a split, merge or drop partition (and the last of the three was the most significant one) every single partition position about the split/merge/drop point would have to be renumbered, and that was done by a single row update to the data dictionary to keep the numbering intact. The steps of 10 were introduced in 10g to deal with the inherent performance problems – particularly the shared pool catastrophe that this could cause.

The steps of 1 for interval partitions allows Oracle to keep track (easily) of what high_value each partition partition represents, and the highest legal partition. Try inserting the values 1,000,000 into the table and re-run the query against tabpart$ and you’ll see Oracle adding part# = 2147483743. So what do you think is going to happen if you try to apply the two mechanisms ?

If you do the interval/range/interval switch every interval part# will be renumbered so to follow the “increment by 10” pattern. If you drop partitions p10000 to p40000 nothing happens to the existing part# values until you get to the command to merge p50000 with the next partition up and then you see this:


SQL> alter table test_user.t1 merge partitions for (45000), for (55000) into partition p_low;

Table altered.

SQL> select obj#, dataobj#, part# from tabpart$ where bo# = 185164 order by part#;

      OBJ#   DATAOBJ#      PART#
---------- ---------- ----------
    185177     185177         10
    185171     185171 2147483648
    185172     185172 2147483649
    185173     185173 2147483650
    185174     185174 2147483651
    185175     185175 2147483652
    185176     185176 2147483742

7 rows selected.


The newly merged partition is a new object, of course, so has a completely new obj# and dataobj#, and it’s been given the part# of 10 (the lowest value for a clean range-partitioned object). Every single interval partition has had its part# decreased by one. The lowest possible interval partition is always given the part# of 2147483648 (0x80000000) and the partition numbering increments by 1 from there onwards. (The numbering gets a little more subtle when you have composite partitioning but a similar approach takes place in tabcompart$).

Pause for thought – if you’re thinking of creating an interval partitioned table that could get close to a running level of 1 million partitions and you start to get rid of old partitions in any version of Oracle then each “drop/merge” partition will update about 1 million rows in the data dictionary – and that’s assuming you don’t have any local indexes that will need to be renumbered in the same way!

Here’s a critical part of the output from tkprof when I recreated the table with 1,000,000 rows – which means 101 partitions – and created a local index on it, before dropping the first 4 partitions and then enabled tracing just before merging the bottom interval partition with the anchor range partition.


update indpart$ set dataobj# = :1, part# = :2, flags = :3, ts# = :4, file# =
  :5, block# = :6, pctfree$ = :7, initrans = :8, maxtrans = :9, analyzetime =
  :10, samplesize = :11, rowcnt = :12, blevel = :13, leafcnt = :14, distkey =
  :15, lblkkey = :16, dblkkey = :17, clufac = :18, pctthres$ = :19
where
 obj# = :20


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       94      0.00       0.00          0          0          0           0
Execute     94      0.00       0.01          0         94        480          94
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      188      0.01       0.01          0         94        480          94


update tabpart$ set dataobj# = :1, part# = :2, ts# = :3, file# = :4, block# =
  :5, pctfree$ = :6, pctused$ = :7, initrans = :8, maxtrans = :9, flags = :10,
   analyzetime = :11, samplesize = :12, rowcnt = :13, blkcnt = :14, empcnt =
  :15, avgspc = :16, chncnt = :17, avgrln = :18
where
 obj# = :19


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse       94      0.00       0.00          0          0          0           0
Execute     94      0.00       0.00          0        188        489          94
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      188      0.00       0.00          0        188        489          94

That’s not a lot of work for my little example with less than 100 partitions – but when you’ve got a million of them, with a handful of indexes, and the partitions have been created over time that’s going to turn into a lot of work that’s going to disrupt the shared pool for a long time, generate a lot of redo, and do a lot of disk reads and writes.

So be cautious with interval partitioning – even in 12.2 (and 18.1, possibly) the ease of use may disappear if you realise too late that you’re going to get into a cycle of partition maintenance.

Footnote for composite partitioning – the limits of 2^20-1 segments (hence subpartitions) still applies, but the necessary update is relevant only at the partition level, not at the subpartition level. The objects updated are tabcompart$ and indcompart$.

Update (included for ironic effect)

The day I posted this note my “Oracle Support Hot Topics” email with a report of the following bug:

Bug 19294302 : DBMS_REDEFINITION DOES NOT WORK WITH INTERVAL PARTITIONS

This was reported for 11.2.0.4, fixed in 12.2. The rediscovery information is:

ORA-14024 during copy_table_dep when the interim table is interval partitioned.

The problem arises if you change a table from simple range partitioned to range with interval – so might be relevant if you have a strategy of doing the interval/range/interval trick.

 

 

February 21, 2018

Interval Partition Problem

Filed under: Bugs,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 8:40 am BST Feb 21,2018

Assume you’ve got a huge temporary tablespace, there’s plenty of space in your favourite tablespace, you’ve got a very boring, simple table you want to copy and partition, and no-one and nothing is using the system. Would you really expect a (fairly) ordinary “create table t2 as select * from t1” to end with an Oracle error “ORA-1652: unable to extend temp segment by 128 in tablespace TEMP” . That’s the temporary tablespace that’s out of space, not the target tablespace for the copy.

Here’s a sample data set (tested on 11.2.0.4 and 12.1.0.2) to demonstrate the surprise – you’ll need about 900MB of space by the time the entire model has run to completion:

rem
rem     Script:         pt_interval_threat_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2018
rem

column today new_value m_today
select to_char(sysdate,'dd-mon-yyyy') today from dual;

create table t1
as
with g as (
        select rownum id
        from dual
        connect by level <= 2e3
)
select
        rownum id,
        trunc(sysdate) + g2.id  created,
        rpad('x',50)            padding
from
        g g1,
        g g2
where
        rownum <= 4e6 --> comment to avoid WordPress format issue
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

I’ve created a table of 4 million rows, covering 2,000 dates out into the future starting from sysdate+1 (tomorrow). As you can see there’s nothing in the slightest bit interesting, unusual, or exciting about the data types and content of the table.

I said my “create table as select” was fairly ordinary – but it’s actually a little bit out of the way because it’s going to create a partitioned copy of this table.


execute snap_my_stats.start_snap

create table t2
partition by range(created)
interval(numtodsinterval(7, 'day'))
(
        partition p_start       values less than (to_date('&m_today','dd-mon-yyyy'))
)
storage(initial 1M)
nologging
as
select
        *
from
        t1
;

set serveroutput on
execute snap_my_stats.end_snap

I’ve created the table as a range-partitioned table with an interval() declared. Conveniently I need only mention the partitioning column by name in the declaration, rather than listing all the columns with their types, and I’ve only specified a single starting partition. Since the interval is 7 days and the data spans 2,000 days I’m going to end up with nearly 290 partitions added.

There’s no guarantee that you will see the ORA-01652 error when you run this test – the data size is rather small and your machine may have sufficient other resources to hide the problem even when you’re looking for it – but the person who reported the problem on the OTN/ODC database forum was copying a table of 2.5 Billion rows using about 200 GB of storage, so size is probably important, hence the 4 million rows as a starting point on my small system.

Of course, hitting an ORA-01652 on TEMP when doing a simple “create as select” is such an unlikely sounding error that you don’t necessarily have to see it actually happen; all you need to see (at least as a starting point in a small model) is TEMP being used unexpectedly so, for my first test (on 11.2.0.4), I’ve included some code to calculate and report changes in the session stats – that’s the calls to the package snap_my_stats. Here are some of the more interesting results:


---------------------------------
Session stats - 20-Feb 16:58:24
Interval:-  14 seconds
---------------------------------
Name                                                                     Value
----                                                                     -----
table scan rows gotten                                               4,000,004
table scan blocks gotten                                                38,741

session pga memory max                                             181,338,112

sorts (rows)                                                         2,238,833

physical reads direct temporary tablespace                              23,313
physical writes direct temporary tablespace                             23,313

The first couple of numbers show the 4,000,000 rows being scanned from 38,741 table blocks – and that’s not a surprise. But for a simple copy the 181MB of PGA memory we’ve acquired is a little surprising, though less so when we see that we’ve sorted 2.2M rows, and then ended up spilling 23,313 blocks to the temporary tablespace. But why are we sorting anything – what are those rows ?

My first thought was that there was a bug in some recursive SQL that was trying to define or identify dynamically created partitions, or maybe something in the space management code trying to find free space, so the obvious step was to enable extended tracing and look for any recursive statements that were running a large number of times or doing a lot of work. There weren’t any – and the trace file (particularly the detailed wait events) suggested the problem really was purely to do with the CTAS itself; so I ran the code again enabling events 10032 and 10033 (the sort traces) and found the following:


---- Sort Statistics ------------------------------
Initial runs                              1
Input records                             2140000
Output records                            2140000
Disk blocks 1st pass                      22292
Total disk blocks used                    22294
Total number of comparisons performed     0
Temp segments allocated                   1
Extents allocated                         175
Uses version 1 sort
Uses asynchronous IO

One single operation had resulted in Oracle sorting 2.14 million rows (but not making any comparisons!) – and the only table in the entire system with enough rows to do that was my source table! Oracle seems to be sorting a large fraction of the data for no obvious reason before inserting it.

  • Why, and why only 2.14M out of 4M ?
  • Does it do the same on 12.1.0.2 (yes), what about 12.2.0.1 (no – but I still saw a large PGA allocation, possibly stats gathering).
  • Is there any clue about this on MoS (yes Bug 17655392 – though that one is erroneously, I think, flagged as “closed not a bug”)
  • Is there a workaround ? (Yes – I think so).

Playing around and trying to work out what’s happening the obvious pointers are the large memory allocation and the “incomplete” spill to disc – what would happen if I fiddled around with workarea sizing – switching it to manual, say, or setting the pga_aggregate_target to a low value. At one point I got results showing 19M rows (that’s not a typo, it really was close to 5 times the number of rows in the table) sorted with a couple of hundred thousand blocks of TEMP used – the 10033 trace showed 9 consecutive passes (that I can’t explain) as the code executed from which I’ve extract the row counts, temp blocks used, and number of comparisons made:


Input records                             3988000
Total disk blocks used                    41544
Total number of comparisons performed     0

Input records                             3554000
Total disk blocks used                    37023
Total number of comparisons performed     0

Input records                             3120000
Total disk blocks used                    32502
Total number of comparisons performed     0

Input records                             2672000
Total disk blocks used                    27836
Total number of comparisons performed     0

Input records                             2224000
Total disk blocks used                    23169
Total number of comparisons performed     0

Input records                             1762000
Total disk blocks used                    18357
Total number of comparisons performed     0

Input records                             1300000
Total disk blocks used                    13544
Total number of comparisons performed     0

Input records                             838000
Total disk blocks used                    8732
Total number of comparisons performed     0

Input records                             376000
Total disk blocks used                    3919
Total number of comparisons performed     0

There really doesn’t seem to be any good reason why Oracle should do any sorting of the data (and maybe it wasn’t given the total number of comparisons performed in this case) – except, perhaps, to allow it to do bulk inserts into each partition in turn or, possibly, to avoid creating an entire new partition at exactly the moment it finds just the first row that needs to go into a new partition. Thinking along these lines I decided to pre-create all the necessary partitions just in case this made any difference – the code is at the end of the blog note. Another idea was to create the table empty (with, and without, pre-created partitions), then do an “insert /*+ append */” of the data.

Nothing changed (much – though the number of rows sorted kept varying).

And then — it all started working perfectly with virtually no rows reported sorted and no I/O to the temporary tablespace !

Fortunately I thought of looking at v$memory_resize_ops and found that the automatic memory management had switched a lot of memory to the PGA, allowing Oracle to do whatever it needed to do completely in memory without reporting any sorting (and a couple of checks on v$process_memory showed that the amount of memory assigned to category “Other” – rather and “SQL”, interestingly – had exceeded the current value of the pga_aggregate_target). A quick re-start of the instance fixed that “internal workaround”.

Still struggling with finding a reasonable workaround I decided to see if the same anomaly would appear if the table were range partitioned but didn’t have an interval clause. This meant I had to precreate all the necessary partitions, of course – which I did by starting with an interval partitioned table, letting Oracle figure out which partitions to create, then disabling the interval feature – again, see the code at the end of this note.

The results: no rows sorted on the insert, no writes to temp. Unless it’s just a question of needing even more data to reproduce the problem with simple range partitioned tables, it looks as if there’s a problem somewhere in the code for interval partitioned tables and all you have to do to work around it is precreate loads of partitions, disable intervals, load, then re-enable the intervals.

Footnote:

Here’s the “quick and dirty” code I used to generate the t2 table with precreated partitions:


create table t2
partition by range(created)
interval(numtodsinterval(7, 'day'))
(
        partition p_start values less than (to_date('&m_today','dd-mon-yyyy'))
)
storage(initial 1M)
nologging
monitoring
as
select
        *
from
        t1
where
        rownum <= 0
;


<<expand>>
declare
        m_max_date      date;
begin
        select  max(created)
        into    expand.m_max_date
        from    t1
        ;

        <<loop>>
        for i in 1..expand.m_max_date - trunc(sysdate) loop
                dbms_output.put(
                        to_char(trunc(sysdate) + loop.i,'dd-mon-yyyy') || chr(9)
                );
                execute immediate
                        'lock table t2 partition for ('''  ||
                        to_char(trunc(sysdate) + loop.i,'dd-mon-yyyy') ||
                        ''') in exclusive mode'
                ;
        end loop;
        dbms_output.new_line();
end;
/

prompt  ========================
prompt  How to disable intervals
prompt  ========================

alter table t2 set interval();

The code causes partitions to be created by locking the relevant partition for each date between the minimum and maximum dates in the t1 table – locking the partition is enough to create it if it doesn’t already exists. The code is a little wasteful since it locks each partition 7 times as we walk through the dates, but it’s only a quick demo for a model and for copying a very large table wastage would probably be very small compared to the work of doing the actual data copy. Obviously one could be more sophisticated and limit the code to locking and creating only the partitions needed, and only locking them once each.

Update 28th Feb 2018

In comment #2 below Alexander Chervinskiy supplies another workaround. If your load is going to go to one partition and you know which one it is then SQL that specifies the target partition does not result in a massive sort – e.g. if you know you data is for 10th March 2018 then the following SQL would behave nicely:


insert
         /*+ append */
 into
        t2 partition for ('10-Mar-2018')
select
        *
from
        t1
/

Note, particularly, the use of the “for ({value})” clause that makes it possible to identify the correct partition without knowing its name.

August 3, 2017

Rebuilding Indexes

Filed under: Indexing,Infrastructure,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:00 pm BST Aug 3,2017

One of the special events that can make it necessary to rebuild an index is the case of the “massive DML”, typically a bulk delete that purges old data from a table. You may even find cases where it’s a good idea to mark a couple of your indexes as unusable before doing a massive delete and then rebuild them after the delete.

Despite the fact that a massive delete is an obvious special case it’s still not necessary in many cases to worry about a rebuild afterwards because the space made free by the delete will be smoothly reused over time with very little variation in performance. There is, however, one particular feature that increases the probability of a rebuild becoming necessary – global (or globally partitioned) indexes on partitioned tables. The problem (and the absence of problem in non-partitioned tables) is in the nature of the rowid.

For non-partitioned tables, and partitioned tables with local indexes, the rowid stored in an index is (assuming we’re thinking only of heap tables) stored as a sequence of 6 bytes consisting, in order, of: (tablespace relative file number, block number within file, row number within block). If the table is non-partitioned, or if this is an index segment from a locally partitioned index, all the index entries will be pointing to the same table segment and Oracle knows which segment that is from the data dictionary information – so Oracle can derive the data_object_id of the table segment and convert the tablespace relative file number into the absolute file number to navigate to the right row in the table.

When the index is global or globally partitioned any index entry may point to any of the table’s segments, so the rowid that is stored in the index is expanded by a further 4 bytes to hold the data_object_id of the table segment it is pointing to – and the data_object_id is the leading component: (data_object_id, tablespace relative file number, block number within file, row number within block). Think about what this means when you start to drop “old” partitions and add new partitions. Compare this with what happens when you simply delete a large volume of old data from a table and starting inserting new data. There’s an important difference to the way in which indexes will evolve.

Purging data

When you delete a large volume of data from a (simple, heap) table you will create a lot of empty space in a lot of existing table blocks. If that delete is supposed to get rid of “old” data (and to keep the description simple we’ll assume it’s the first time you’ve done this) it’s very likely that the delete will result in lots of empty blocks near the start of the table – rows that were inserted at the same time will tend to be physically close to each other in the table. This means that future inserts will soon start to reuse those table blocks. Think about what this means for index entries – especially for non-unique keys.

Assume you have 100 rows with value ‘XXX’ for an indexed column. Breaking the rowid into its component parts the index entries will be (‘XXX’,{file_id, block_id, row_number}).  Now assume you delete the oldest 10 rows then, over time, insert 10 more rows with the same key value. You’ve deleted the 10 index entries with the lowest values for (file_id, block_id) but the space that’s become available in the table will be in and around exactly that range of blocks – so the new index entries will probably end up looking very similar to the deleted index entries and inserted in and around the existing index entries for value ‘XXX’, so over time the index is unlikely to allocate much new space.

Now think about what happens when your table it partitioned but the index is global; your index entries are (‘XXX’,{data_object_id, file_id, block_id, row_number}). When you drop the oldest partition you will probably[1] delete all the index entries with the lowest data_object_id. When you start inserting new rows for ‘XXX’ the new table partition will have a data_object_id that is going to be higher than any previous data_object_id – which means you’re going to be inserting rows into the right-hand (high-value) edge of this section of the index. In some cases – typically those where you have a couple of leaf blocks per key value – the index may end up growing significantly because the insertion point for rows in the new partition isn’t in the leaf block with the available space, and it won’t be until you’ve done a few more bulk deletes and the leaf blocks relating to the oldest table partitions become completely empty that the space can be reused.

An example of this type of behaviour probably appeared on the OTN database forum quite recently.  Of course, there are various reasons why indexes can become inefficient, and the degree of inefficiency may only become noticeable over a fairly long period of time; moreover there are various reasons why global indexes are a little problematic, and various reasons why a bulk delete (which is what executing “alter table drop partition” does to a global index) has unpleasant side effects dependent somewhat on the number (and size) of the partitions and on how many you try to drop in one go.

There’s not  a lot you can do about this quirk of global indexes, but it’s always worth taking extra care with partitioned tables and focusing even more carefully on a strategic review of indexes:

  • Does this index really need to exist at all
  • Could this index be replaced by a selective function-based index
  • Does this index really need to be global / globally partitioned
  • How big is this index compared to the size it ought to be
  • Should this index be (basic) compressed
  • Is this index likely to be disrupted by a historic purge – or is there another reason for its undesirable behaviour

 

[1] probably delete entries with the lowest data_object_id” – I have to say this because if you’ve executed a “move partition” at any time a new data_object_id will have been generated for the partition, so the oldest partition could, in principal, have the highest data_object_id. The issue of changing data_object_ids brings a whole new level of complexity to global indexes – but only in a few special cases, fortunately.

 

 

June 9, 2017

12.2 Partitions

Filed under: 12c,Indexing,Oracle,Partitioning,Upgrades — Jonathan Lewis @ 10:13 am BST Jun 9,2017

At the end of my presentation to the UKOUG Database SIG yesterday I summed up (most of) the points I’d made with a slide making the claim:

In 12.2 you can: convert a simple heap table to partitioned with multi-column automatic list partitions, partially indexed, with read only segments, filtering out unwanted data, online in one operation.

 

Last night I decided I ought to demonstrate the claim – so here’s a little code, first creating a simple heap table:


rem
rem     Script:         122_features.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2017
rem
rem     Last tested
rem             12.2.0.1
rem

create table t1(
        date_start      not null,
        date_end        not null,
        id              not null,
        client_id,
        resort_code,
        uk_flag,
        v1,
        padding,
        constraint t1_range_ck check ((date_end - date_start) in (7, 14, 21))
)
segment creation immediate
nologging
as
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        trunc(sysdate,'yyyy') + 7 *  mod(rownum, 8)                                     date_start,
        trunc(sysdate,'yyyy') + 7 * (mod(rownum, 8) + trunc(dbms_random.value(1,4)))    date_end,
        rownum                                          id,
        trunc(dbms_random.value(1e5,2e5))               client_id,
        trunc(dbms_random.value(1e4,2e4))               resort_code,
        case when mod(rownum,275) = 0 then 1 end        uk_flag,
        lpad(rownum,10,'0')                             v1,
        lpad('x',100,'x')                               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e5 -- > "GT" inserted to avoid WordPress formatting issue
;

create index t1_client_idx on t1(client_id);
create index t1_resort_idx on t1(resort_code);
create index t1_ukflag_idx on t1(uk_flag);

alter table t1 add constraint t1_pk primary key(id);

I’ve got a table which models a travel company that arranges holidays that last one, two, or three weeks and (for convenience) they all start on the same day for the week. So I generate a start and end date for each row, making sure the start date is a multiple of seven days from a base date while the end date is 7, 14, or 21 days later. I’ve got a few indexes on the data, and a primary key constraint. There’s a special flag column on the table for holidays in the UK, which is a small parcentage of the holidays booked.

Eventually, when the data gets too big, I decide that I want to partition this data, and the obvious partitioning idea that springs to mind is to partition it so that holidays with the same start date and duration are all in the same partition and each partition holds a single start/duration.

I’ve also decided that I’m going to make old data read-only, and I’m not interested in the UK holidays once they’ve gone into history so I’m going to get rid of some of them.

The index protecting the primary key will have to be global since it won’t contain the partition key; since the index on uk_flag covers a small amount of data I’m going to keep that global as well, but I want the other two indexes to be local – except for the older data I’m not really interested in keeping the index on client id.

And I don’t want to stop the application while I’m restructuring the data.

So here’s my one SQL statement:


alter table t1 modify 
partition by list (date_start, date_end) automatic (
        partition p11 values (to_date('01-Jan-2017'),to_date('08-Jan-2017')) indexing off read only,
        partition p12 values (to_date('01-Jan-2017'),to_date('15-Jan-2017')) indexing off read only,
        partition p13 values (to_date('01-Jan-2017'),to_date('22-Jan-2017')) indexing off read only,
        partition p21 values (to_date('08-Jan-2017'),to_date('15-Jan-2017')) indexing off read only,
        partition p22 values (to_date('08-Jan-2017'),to_date('22-Jan-2017')) indexing off read only,
        partition p23 values (to_date('08-Jan-2017'),to_date('29-Jan-2017')) indexing off read only,
        partition p31 values (to_date('15-Jan-2017'),to_date('22-Jan-2017')) indexing off read only,
        partition p32 values (to_date('15-Jan-2017'),to_date('29-Jan-2017')) indexing off read only,
        partition p33 values (to_date('15-Jan-2017'),to_date('05-Feb-2017')) indexing off read only
)
including rows where uk_flag is null or (date_start > to_date('01-feb-2017','dd-mon-yyyy'))
online
update indexes (
        t1_client_idx local indexing partial,
        t1_resort_idx local,
        t1_ukflag_idx indexing partial
)
;

Key Points

  • partition by list (date_start, date_end) — partitioned by a multi-column list
  • automatic — if data arrives for which there is no existing partition a new one will be created
  • indexing off — some of my partitions (the pre-defined (oldest) ones) will be subject to partial indexing
  • read only — some of my partitions (the pre-defined (oldest) ones) will be made read only
  • including rows where — some of my rows will disappear during copying [1]
  • online — Oracle will be journalling the data while I copy and will apply the journal after the copying completes
  • update indexes – specify some details about indexes [2]
  • local — some of the rebuilt indexes will be local
  • indexing partial — some of the rebuilt indexes will not hold data (viz: for the partitions declared “indexing off”)

I’ve footnoted a couple of the entries:

[1] – the copy is done read-consistently so data inserted while the copy takes place will still appear in the final table, even if it looks as if it should have failed the including rows clause.

[2] – indexes which include the partition key will automatically be created as local indexes (but you can declare them here as global, or globally partitioned, if you want to). The manual has an error on this point; it suggests that prefixed indexes will be created as local indexes but then defines “prefixed” to mean contains the partition key” rather than the usual starts with the partition key”.

Job done – except for

  • the exhaustive tests that it’s been done correctly
  • the load test to see how it behaves when lots of new holidays are being booked and current ones being modified
  • a little bit of clearing up of “surprise” partitions that shouldn’t be ther
  • changing some of the automatically generated table partitions to be “indexing off” (if and when necessary)

Here are a few queries – with results – showing the effects this one statement had:


select count(*) from t1;

/*
  COUNT(*)
----------
     99773

-- some rows (old UK) have disappeared from the original 100,000
*/


select
        index_name, partitioned, status, leaf_blocks, num_rows , indexing, orphaned_entries
from
        user_indexes
where   table_name = 'T1'
order by
        partitioned, index_name
;

/*
INDEX_NAME           PAR STATUS   LEAF_BLOCKS   NUM_ROWS INDEXIN ORP
-------------------- --- -------- ----------- ---------- ------- ---
T1_PK                NO  VALID            263      99773 FULL    NO
T1_UKFLAG_IDX        NO  VALID              1        136 PARTIAL NO
T1_CLIENT_IDX        YES N/A              149      62409 PARTIAL NO
T1_RESORT_IDX        YES N/A              239      99773 FULL    NO

-- Indexes: Local or global, full or partial.
*/

select
        segment_type, segment_name, count(*)
from
        user_segments
group by
        segment_type, segment_name
order by
        segment_type desc, segment_name
;

/*
SEGMENT_TYPE       SEGMENT_NAME                COUNT(*)
------------------ ------------------------- ----------
TABLE PARTITION    T1                                24
INDEX PARTITION    T1_CLIENT_IDX                     15
INDEX PARTITION    T1_RESORT_IDX                     24
INDEX              T1_PK                              1
INDEX              T1_UKFLAG_IDX                      1

-- One local index has fewer segments than the other
*/

set linesize 180
set trimspool on

column high_value format a85
break on index_name skip 1
set pagesize 200

select
        index_name, status, leaf_blocks, num_rows, partition_name, high_value
from
        user_ind_partitions
where
        index_name = 'T1_CLIENT_IDX'
--      index_name like 'T1%'
order by
        index_name, partition_position
;

/*
INDEX_NAME           STATUS   LEAF_BLOCKS   NUM_ROWS PARTITION_NAME         HIGH_VALUE
-------------------- -------- ----------- ---------- ---------------------- -------------------------------------------------------------------------------------
T1_CLIENT_IDX        UNUSABLE           0          0 P11                    ( TO_DATE(' 2017-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-01-08 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     UNUSABLE           0          0 P12                    ( TO_DATE(' 2017-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-01-15 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     UNUSABLE           0          0 P13                    ( TO_DATE(' 2017-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-01-22 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     UNUSABLE           0          0 P21                    ( TO_DATE(' 2017-01-08 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-01-15 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     UNUSABLE           0          0 P22                    ( TO_DATE(' 2017-01-08 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-01-22 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     UNUSABLE           0          0 P23                    ( TO_DATE(' 2017-01-08 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-01-29 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     UNUSABLE           0          0 P31                    ( TO_DATE(' 2017-01-15 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-01-22 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     UNUSABLE           0          0 P32                    ( TO_DATE(' 2017-01-15 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-01-29 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     UNUSABLE           0          0 P33                    ( TO_DATE(' 2017-01-15 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                            , TO_DATE(' 2017-02-05 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN')
                                                                             )

                     USABLE            10       4126 SYS_P1528              ( TO_DATE(' 2017-01-22 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-05 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4198 SYS_P1529              ( TO_DATE(' 2017-01-29 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-19 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4211 SYS_P1530              ( TO_DATE(' 2017-02-05 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-19 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4214 SYS_P1531              ( TO_DATE(' 2017-02-12 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-26 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4195 SYS_P1532              ( TO_DATE(' 2017-02-19 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-03-12 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4113 SYS_P1533              ( TO_DATE(' 2017-01-22 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-01-29 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE             9       4027 SYS_P1534              ( TO_DATE(' 2017-01-29 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-05 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4217 SYS_P1535              ( TO_DATE(' 2017-02-12 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-19 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4167 SYS_P1536              ( TO_DATE(' 2017-02-19 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-03-05 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4230 SYS_P1537              ( TO_DATE(' 2017-01-29 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-12 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4113 SYS_P1538              ( TO_DATE(' 2017-02-05 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-26 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4069 SYS_P1539              ( TO_DATE(' 2017-02-12 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-03-05 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4215 SYS_P1540              ( TO_DATE(' 2017-01-22 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-12 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4138 SYS_P1541              ( TO_DATE(' 2017-02-19 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-26 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )

                     USABLE            10       4176 SYS_P1542              ( TO_DATE(' 2017-02-05 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                            , TO_DATE(' 2017-02-12 00:00:00', 'syyyy-mm-dd hh24:mi:ss', 'nls_calendar=gregorian')
                                                                             )


*/

I’ve limited the index partition output to the index with partial indexing enabled to show that its pre-defined partitions are marked as unusable and, as you can infer from the segement summary, those unusable index partition don’t have any segment space allocated to them.

Stress tests are left to the interested reader.

July 8, 2016

DML and Bloom

Filed under: Bugs,Oracle,Partitioning,Troubleshooting — Jonathan Lewis @ 1:01 pm BST Jul 8,2016

One of the comments on my recent posting about “Why use pl/sql bulk strategies over simple SQL” pointed out that it’s not just distributed queries that can change plans dramatically when you change from a simple select to “insert into … select …”; there’s a similar problem with queries that use Bloom filters – the filter disappears when you change from the query to the DML.

This seemed a little bizarre, so I did a quick search on MoS (using the terms “insert select Bloom Filter”) to check for known bugs and then tried to run up a quick demo. Here’s a summary of the related bugs that I found through my first simple search:

Bug 20112932 : BLOOM FILTER IS NOT CHOOSED WHEN EXECUTING INSERT SELECT STATEMENT
11.2.0.4 - Nov 2014: "not a bug".
This cites three more documents:
  Bug 13801198  BLOOM PRUNING/FILTER NOT USED IN DML STATEMENTS 
    11.2.0.3 - March 2012 - Duplicate, base bug: 18949550
    18949550: MERGE QUERY NOT USING BLOOM FILTER
    11.2.0.3 - June 2014: fixed in 12.2 (no patches yet)

  Bug 14325392  BLOOM FILTER NOT SUPPORTED IN PARTITION WISE JOIN 
    11.2.0.3 - July 2012 - fixed in 12.1 
    (but I think this is a PX filter, not a partition filter)

  Bug 17716301  BLOOM FILTER NOT USED FOR OUTER JOIN</pre>

    Aug 2014 - hidden bug note. Patches available for 11.2.0.3
    Note 1919508.1 describes the problem.

And here’s some code to demonstrate the problem – running 11.2.0.4 (with similar results on 12.1.0.2).


rem
rem     Script:         dml_bloom_bug.sql
rem     Author:         Jonathan Lewis
rem     Dated:          June 2016
rem

create table pt_hash (
        id,
        grp,
        small_vc,
        padding
)
nologging
pctfree 90 pctused 10
partition by hash(grp)
(
        partition p1,
        partition p2,
        partition p3,
        partition p4
)
as
with generator as (
        select  --+ materialize
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        rownum                          id,
        trunc(rownum/50)                grp,
        to_char(trunc(rownum/20))       small_vc,
        rpad('x',100)                   padding
from
        generator, generator
where
        rownum <= 1e5
;

create table t1
as
select
        rownum  id,
        rownum  n1,
        lpad(rownum,10) v1,
        rpad('x',100) padding
from
        dual
connect by
        rownum <= 4
;

create table target (
        n1      number,
        id      number
)
;

-- gather stats on t1 and pt_hash (for all columns size 1, otherwise default)

There’s nothing fancy in this code, just a hash partitioned table with 100,000 rows (and a lot of wasted space), and a “driving” table with 4 rows that I can use in a very simple join to the partitioned table. I’ve also got an empty table ready for an insert statement.

So here’s a query with its execution plan (with rowsource execution statistics) that (based on the content of the t1 table) should select from at most two of the four partitions. I’ve hinted a hash join which is where Bloom filtering can be very effective with partitioned table joins:


set serveroutput off
alter session set statistics_level = all;

select
        /*+
                leading(t1 h1)
                use_hash(h1)    no_swap_join_inputs(h1)
        */
        t1.n1,
        h1.id
from
        t1,
        pt_hash h1
where
        t1.id between 2 and 3
and     h1.grp = t1.n1
and     h1.id <= 100
;

select * from table(dbms_xplan.display_cursor(null,null,'outline allstats last cost partition'));

------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |      1 |        |  2115 (100)|       |       |      1 |00:00:00.16 |    8374 |   8367 |       |       |          |
|*  1 |  HASH JOIN                  |         |      1 |      3 |  2115   (2)|       |       |      1 |00:00:00.16 |    8374 |   8367 |  2440K|  2440K|  777K (0)|
|   2 |   PART JOIN FILTER CREATE   | :BF0000 |      1 |      3 |     2   (0)|       |       |      2 |00:00:00.01 |       2 |      0 |       |       |          |
|*  3 |    TABLE ACCESS FULL        | T1      |      1 |      3 |     2   (0)|       |       |      2 |00:00:00.01 |       2 |      0 |       |       |          |
|   4 |   PARTITION HASH JOIN-FILTER|         |      1 |    100 |  2113   (2)|:BF0000|:BF0000|     51 |00:00:00.16 |    8372 |   8367 |       |       |          |
|*  5 |    TABLE ACCESS FULL        | PT_HASH |      2 |    100 |  2113   (2)|:BF0000|:BF0000|     51 |00:00:00.16 |    8372 |   8367 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

-- Predicates and Outline redacted

The most significant points to notice are:

  • Operation 2 – Part Join Filter Created: we’ve created a Bloom filter
  • Operation 4 – Partition Hash Join – Filter: we’ve used the Bloom filter, Pstart/Pstop = :BF0000
  • Operations 5 – Table Access Full: starts twice, the two partitions we need

And now we use the same SQL statement but insert the result set into the holding table:

insert into target(n1, id)
select
        /*+
                ordered
                use_hash(h1)    no_swap_join_inputs(h1)
        */
        t1.id,
        h1.id
from
        t1,
        pt_hash h1
where
        t1.id between 2 and 3
and     h1.grp = t1.n1
and     h1.id <= 100
;

select * from table(dbms_xplan.display_cursor(null,null,'outline allstats last cost partition'));

------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                | Name    | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT         |         |      1 |        |  2115 (100)|       |       |      0 |00:00:00.02 |   16698 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL |         |      1 |        |            |       |       |      0 |00:00:00.02 |   16698 |       |       |          |
|*  2 |   HASH JOIN              |         |      1 |      3 |  2115   (2)|       |       |      1 |00:00:00.02 |   16689 |  1969K|  1969K|  723K (0)|
|*  3 |    TABLE ACCESS FULL     | T1      |      1 |      3 |     2   (0)|       |       |      2 |00:00:00.01 |       2 |       |       |          |
|   4 |    PARTITION HASH ALL    |         |      1 |    100 |  2113   (2)|     1 |     4 |    100 |00:00:00.02 |   16687 |       |       |          |
|*  5 |     TABLE ACCESS FULL    | PT_HASH |      4 |    100 |  2113   (2)|     1 |     4 |    100 |00:00:00.02 |   16687 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------

-- Predicate and Outline sections redacted

In this case we don’t see any operation to create a Bloom filter, and operation 5 reports 4 starts, one for each partition, with the Pstart/Pstop showing 1 – 4. As a cross-check you can also see that the buffers visited has doubled and the time (though don’t trust it too much on such a small test) has increased.

To my surprise the outline information for the first plan didn’t report a Bloom filter hint (px_join_filter), but I tried adding one to the insert statement anyway – and nothing changed; however there is a “subquery pruning” mechanism that’s been available to the optimizer for a very long time now so I decided to try hinting that for both queries /*+ subquery_pruning(@sel$1 h1@sel$1 partition) */. This had no effect on the select statement, but produced the following plan for the insert:


-------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name    | Starts | E-Rows | Cost (%CPU)| Pstart| Pstop | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT          |         |      1 |        |  2115 (100)|       |       |      0 |00:00:00.01 |    8390 |       |       |          |
|   1 |  LOAD TABLE CONVENTIONAL  |         |      1 |        |            |       |       |      0 |00:00:00.01 |    8390 |       |       |          |
|*  2 |   HASH JOIN               |         |      1 |      3 |  2115   (2)|       |       |      1 |00:00:00.01 |    8381 |  1969K|  1969K|  829K (0)|
|*  3 |    TABLE ACCESS FULL      | T1      |      1 |      3 |     2   (0)|       |       |      2 |00:00:00.01 |       2 |       |       |          |
|   4 |    PARTITION HASH SUBQUERY|         |      1 |    100 |  2113   (2)|KEY(SQ)|KEY(SQ)|     51 |00:00:00.01 |    8379 |       |       |          |
|*  5 |     TABLE ACCESS FULL     | PT_HASH |      2 |    100 |  2113   (2)|KEY(SQ)|KEY(SQ)|     51 |00:00:00.01 |    8377 |       |       |          |
-------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see operation 4 now reports Partition Hash Subquery – Oracle has “pre-queried” the driving table (you can see the statement in the 10046 trace file if you enable SQL trace) to discover which partitions it will have to visit – and operation 5 starts twice with a Pstart/Pstop of KEY(SQ).

Interestingly the hint didn’t end up in the Outline section of the execution plan, so this is an example of a query which you can’t fix by capturing the SQL Plan Baseline for one statement against the text for another – though you can hack the hint into an SQL Profile or (my preference in this case, I think) an SQL Patch.

Footnote:

After publishing this note I started to browse through the comments on the article I’d linked to reference subquery pruning, and discovered that an example of this insert/select issue appeared there with a surprising workaround, which was to use the “multi-table insert” (with just a single table). At the moment, though I haven’t managed to get this working with this example.

Footnote 2:

I’ve just been reminded by email of another detail mentioned in the comments of the earlier post – this problem isn’t present for direct path (i.e. /*+ append */) inserts – and I have run a quick test on 12.1.0.2 to check that this is indeed the case (though the fix isn’t present in 11.2.0.4).  Don’t forget to check the listed Bug notes on MoS, in case there’s a back-port available for your version of Oracle.

 

May 23, 2016

Virtual Partitions

Filed under: 12c,CBO,Infrastructure,Oracle,Partitioning — Jonathan Lewis @ 1:16 pm BST May 23,2016

Here’s a story of (my) failure prompted by a recent OTN posting.

The OP wants to use composite partitioning based on two different date columns – the table should be partitioned by range on the first date and subpartitioned by month on the second date. Here’s the (slightly modified) table creation script he supplied:


rem
rem     Script: virtual_partition.sql
rem     Dated:  May 2016
rem

CREATE TABLE M_DTX
(
        R_ID    NUMBER(3),
        R_AMT   NUMBER(5),
        DATE1   DATE,
        DATE2   DATE,
        VC GENERATED ALWAYS AS (EXTRACT(MONTH FROM DATE2))
)
PARTITION BY RANGE (DATE1) interval (numtoyminterval(1,'MONTH'))
SUBPARTITION BY LIST (VC)
        SUBPARTITION TEMPLATE (
                SUBPARTITION M1 VALUES (1),
                SUBPARTITION M2 VALUES (2),
                SUBPARTITION M3 VALUES (3),
                SUBPARTITION M4 VALUES (4),
                SUBPARTITION M5 VALUES (5),
                SUBPARTITION M6 VALUES (6),
                SUBPARTITION M7 VALUES (7),
                SUBPARTITION M8 VALUES (8),
                SUBPARTITION M9 VALUES (9),
                SUBPARTITION M10 VALUES (10),
                SUBPARTITION M11 VALUES (11),
                SUBPARTITION M12 VALUES (12)
        )
        (
        PARTITION M_DTX_2015060100 VALUES LESS THAN (TO_DATE('2015-06-01 00:00:01', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
        )
;

There’s nothing particularly exciting about this – until you get to the query requirement – the user wants to query on date1 and date2, and doesn’t know about the virtual month column, e.g. (and, I know that there should be a to_date() or ANSI equivalent here):

SELECT * FROM m_dtx WHERE date1 = trunc(sysdate) AND date2 = '01-Jun-2016';

Now, as a general rule, you don’t expect partition elimination to occur unless the partitioning column appears with a predicate that make elimination possible, so your first response to this query is that it could eliminate on date1, but can’t possibly eliminiate on vc because vc isn’t in the where clause. However it’s possible that the partitioning code might be coded to recognise that the subpartition is on a virtual column that is derived from date2, so perhaps it could generate a new predicate before optimising, for example:

date2 = '01-Jun-2016'  => vc = 6

Unfortunately, your first response is correct – the optimizer doesn’t get this clever, and doesn’t do the sub-partition elimination. Here’s the execution plan from 12.1.0.2 for the sample query, followed by the execution plan when I explicitly add the predicate vc = 6.


SQL_ID  8vk1a05uv16mb, child number 0
-------------------------------------
SELECT /*+ dynamic_sampling(0) */  * FROM m_dtx WHERE date1 =
trunc(sysdate) AND date2 = to_date('01-Jun-2016','dd-mon-yyyy')

Plan hash value: 3104206240

------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |       |       |    15 (100)|          |       |       |
|   1 |  PARTITION RANGE SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
|   2 |   PARTITION LIST ALL   |       |     1 |    57 |    15   (7)| 00:00:01 |     1 |    12 |
|*  3 |    TABLE ACCESS FULL   | M_DTX |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("DATE2"=TO_DATE(' 2016-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "DATE1"=TRUNC(SYSDATE@!)))



SQL_ID  33q012bdhjrpn, child number 0
-------------------------------------
SELECT /*+ dynamic_sampling(0) */  * FROM m_dtx WHERE date1 =
trunc(sysdate) AND date2 = to_date('01-Jun-2016','dd-mon-yyyy') and vc
= 6

Plan hash value: 938710559

------------------------------------------------------------------------------------------------
| Id  | Operation              | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |       |       |       |    15 (100)|          |       |       |
|   1 |  PARTITION RANGE SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
|   2 |   PARTITION LIST SINGLE|       |     1 |    57 |    15   (7)| 00:00:01 |     6 |     6 |
|*  3 |    TABLE ACCESS FULL   | M_DTX |     1 |    57 |    15   (7)| 00:00:01 |   KEY |   KEY |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(("DATE2"=TO_DATE(' 2016-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "DATE1"=TRUNC(SYSDATE@!)))


Note how the predicate vc = 6  doesn’t show up in the predicate section in either case, but the execution plan shows PARTITION LIST ALL at operation 2 when we omit the predicate and PARTITION LIST SINGE when we include it (with suitable values also appearing for Pstart and Pstop). (The cost, by the way, is the cost of scanning a whole (range)partition whether or not the optimizer expects to restrict that scan to just one sub-partition.)

So the optimizer isn’t quite clever enough (yet). BUT … the optimizer can be very clever with constraints, combining constraints with predicates and applying transitive closure to produce new predicates – so maybe we could get the optimizer to do this if we helped it a little bit. Given the table definition supplied I’m going to assume that the date2 column is supposed to be non-null, so let’s add some truthful constraints/declarations to the table definition:


alter table m_dtx modify date2 not null;
alter table m_dtx modify vc  not null;
alter table m_dtx add constraint md_ck_vc check (vc = extract(month from date2));

Alas, this didn’t make any difference to the execution plan. But it did do something surprising to my attempts to load data into the table:


insert into m_dtx (r_id, r_amt, date1, date2)
with generator as (
        select
                rownum id
        from dual
        connect by
                level <= 1e4
)
select
        mod(rownum, 1000),
        rownum,
        trunc(sysdate,'yyyy') + dbms_random.value(0,365),
        trunc(sysdate,'yyyy') + dbms_random.value(0,365)
from
        generator       v1,
        generator       v2
where
        rownum <= 1e4
;

insert into m_dtx (r_id, r_amt, date1, date2)
*
ERROR at line 1:
ORA-01400: cannot insert NULL into (???)

So the array insert with the virtual column doesn’t like the NOT NULL constraint on the virtual column because vc is, presumably, still null when the constraint is checked (though there’s no problem with single row inserts with the values() clause – I wonder what happens with the PL/SQL “FORALL” clause) – so let’s remove the not null constraint on vc and see what happens.


insert into m_dtx (r_id, r_amt, date1, date2)
*
ERROR at line 1:
ORA-02290: check constraint (TEST_USER.MD_CK_VC) violated

Unsurprisingly, given the fact that Oracle didn’t like the not null constraint, the critical check constraint also fails. This, by the way, is odd because a check constraint should accept a row when the constraint doesn’t evaluate to FALSE, so (a) vc can’t have been evaluated at this point or the constraint would evaluate to TRUE – which is not FALSE, and (b) vc at this point can no longer be null or the constraint would evaluate to NULL – which is not FALSE: so what “value” has vc got that makes the constraint check return FALSE ?

Bottom line:

I can see some scope for an optimizer enhancement that tries to find eliminating predicates from virtual columns; and I think there’s a need for ensuring that we can safely add constraints to virtual columns – after all we might want to create an index on a virtual column and sometimes we need a NOT NULL declaration to ensure that an index-only execution path can be found. Unfortunately I have to end this blog without finding an immediate solution for the OP.

Despite this failure, though, there are cases (as I showed a couple of years ago) where the optimizer in 12c can get clever enough to recognize the connection between a queried date column and the virtual partitioning column based on that date column.

Next Page »

Powered by WordPress.com.